WebQL Turns the Web Into A Giant Database

Catch up on stories from the past week (and beyond) at the Slashdot story archive

WebQL Turns the Web Into A Giant Database 84

Posted by CmdrTaco on Sunday November 26, 2000 @11:44PM from the select-*-from-internet dept.

An anonymous reader says " This article was posted on ZDNet by Bill Machrone on a new type of query language for aggregating information from the Web." Somewhat light on the details, but definitely something to think about.

This discussion has been archived. No new comments can be posted.

WebQL Turns the Web into a Giant Database

Load All Comments

Search 84 Comments Log In/Create an Account

Comments Filter:

but how do you pronounce it? (Score:1)

by yomahz ( 35486 ) writes:

web-quel or web-Q-L?
--
A mind is a terrible thing to taste.
System Requirements (Score:1)

by mr_gerbik ( 122036 ) writes:

System Requirements:
6GB HD suggested (10MB required to install)

Must use 5.99 GB of virtual memory I guess.
Ingenius! (Score:1)

by MathJMendl ( 144298 ) writes:

Wait a second... so I'm supposed to believe that this goes onto the web and takes information off of it from other computers? Whoa... that's a great idea! Someone should patent it if Al Gore hasn't already. Who would have known that something like this could happen to the Internet.
I have my doubts (Score:2)

by superid ( 46543 ) writes:

I've done my share of "screen scraping"...that new buzzword where I grab the html and apply various forms of on the fly text processing in an attempt to grab the meat of whatever content is being presented.
Remember, remote content is not under your control. It will change (often) and is very very likely to not have a nice structure, and is even more likely to contain mismatched tags and other errors.
OK, its in its infancy, but IMHO if/when XHTML is widely adopted, a special query language or tool will largely be irrelevant because most of what is alleged in that brief article could be done in the magical wonderful world of XML.
How much better could this be in terms of PR (Score:1)

by fosh ( 106184 ) writes:

Seriously guys,

A company comes out with a product that is atleast fairly cool. If it does nothing that it says it will, it atleast will turn some heads and get some attention. So at its worst, this is AMAZING PR for linux in general. Can't you just picture it now:

MS Driod: Its good, we don't have anything like it, lets buy the company.
Tech Dude: Uhh, Sir, this runs on Linux. We *CANT* aquire them.

Beyond that, this reaffirms the idea that Linux is a valid operating system for the enterprise, and MySQL as a valid Data base solution. Even though I know this, and you probably know this, what matters is the the VP's and VC's know this. My company is forced to use Oracle because our clients "don't trust PostgreSQL nor MySQL."

So the way I see it, we win!

--Alex the Fishman, (very proud fishman)
Looks like a SPAMmer's dream (Score:1)

by KNicolson ( 147698 ) writes:

I followed the ZD links at the bottom to this page [zdnet.com] and it waxes lyrical about the ability to pull every phone number off a web site. Replace phone number with email address, and it seems a bit worrying. Couple with that the SQL programmability, and we could be looking at something that can auto-harvest de-anti-SPAMmed addresses, perhaps?
WebQL sounds good but... (Score:1)

by NewbieSpaz ( 172080 ) writes:

WebQL Turns the Web into a Giant Database
Isn't the Web really a giant database without 'WebQL'??? If the web isn't already a database of sorts, then what is it???
I like it. (Score:1)

by DeadMeat (TM) ( 233768 ) writes:

For once, innovation in Internet software! Instead of somebody cranking out yet another Internet file sharing service or portal, a company is finally doing something creative. This is one of those "why-didn't-I-think-of-that" ideas. If this really works, they could make a fortune.
What's great is that this program could, if scaled down into a personal edition, make the Internet much more accessible and useable for novices. An acquaitance once asked me to show him how to do basically the same thing: he wanted a macro to automatically import his stock portfolio information from a web page into a spreadsheet. He was kind of spoiled by a simple TRS-80 BASIC program that he had once used in the hey-days of computing that automatically dialed up Compuserve (or some other online service, I don't recall which) that would automatically download and parse stock information from one of the Compuserve forums. I had to tell him that there's no simple way to parse HTML like that, but it would have been much better if I could have pointed him to a personal version of WebSQL to use for exactly this. Just imagine if Web sites could make ready-made WebSQL scripts for their portals for users to download and use with their favorite spreadsheet or database.

Not to mention what I could do with such a fun toy. :) If they make a personal edition, and it works as advertised, I'll buy a copy for sure.
Databases, web, syndication (Score:2)

by ivanl ( 74321 ) writes:

The idea is this: you want to link everyones' databases together. But linking databases is a sensitive (security-wise for one) issue and you have to have agreements on a one-by-one basis.
The breakthru is that you notice almost everyone (those significant enough) have a web frontend to their database. Now if you can just go via that web front, you don't have to go direct to the database and can bypass all the above issue!
The first company that I know of that does this is an Israelic company: Orsus.com [orsus.com]. Since then, OnePage.com [onepage.com] also does it.
What they (at least Orsus) did was build a language (based on XML) that instructs a web spidering engine that has the ability to parse HTML and Javascript. A GUI IDE (no less) is used by the lay person to write the XML-based code.
I wonder if... (Score:1)

by max99ted ( 192208 ) writes:

...this could be related to a recent news article on Dr. Robert Kahn.
An exerpt - I'm too tired to find the link...
THE COPYRIGHT CONUNDRUM
Another problem is with copyrights and other protections of intellectual property. As we have learned from the recent Napster battles, this can be a real sticky wicket on the Net, where users can so easily and freely trade files regardless of any such protection. Because the current system is intentionally oblivious to what's in those Internet packets being transferred, there's no easy way to protect copyrighted data. With all that in mind, Kahn decided it was necessary to develop a new framework for dealing with all the information on the Internet that would act as a layer above the existing infrastructure but deal with the "what" as much as the "where." So through the 90s, while most of the world was just discovering the Internet, Kahn was working on how to reinvent it. His new system is called the "Handle System." Instead of identifying the place a file is going to or coming from, it assigns an identifier called a "handle" to the information itself, called "digital objects." A digital object is anything that can be stored on a computer: a web page, a music file, a video file, a book or chapter of a book, your dental x-rays - you name it. Similar to the way a host name is resolved to an IP address, the handle will be resolved into information the computer needs to know about the object. Only since the information is now about the object, the location of the object is just one of the bits of information that is important. The handle record will also tell the computer things like what kind of file the object it is, how often it will be updated and how the object is allowed to be used - whether there are any copyright or privacy protections. The record can also have any industry-specific information about the object, like a book's International Standard Book Number (ISBN) code. There are two other crucial things about these records: First, each handle can have multiple records associated with it - allowing multiple copies of the same information to be stored on different servers or for different systems. Second, the handle record is updated by the owner of the information - something in stark contrast to the host-name data, which is updated by central repository companies like Network Solutions. This will make things like changing the location of the file much more seamless, rather than waiting days for a new IP address to be updated in the DNS server.
Wouldn't merging the querying features with the above "Handle System" seem a wise thing to do? Maybe that's what it already does...
Hey Taco (Score:2)

by Shoeboy ( 16224 ) writes:

Never use select * from... for production queries if you can help it. It's bad style. If you change your schema to include more columns you can wind up returning more data to the front end than you need to. This caused errors in display at worst and wasted bandwidth at best.
If you have different developers working on the front end and the database, this will really make them hate each other. It also makes the query optimizer work harder than it needs to (the amount of cpu wasted this way is totally insignificant, but it's bad form anyway.
Also, if you're going to run select * from internet without a where clause, be prepared for an extremely long running query.
--Shoeboy
Re:System Requirements (Score:1)

by --delphi-- ( 131620 ) writes:

10mb for the program...other hd space required for storing data
Re:Ingenius! (Score:3)

by Shoeboy ( 16224 ) writes: on Sunday November 26, 2000 @07:14PM (#599522) Homepage

ROTFLMAO!!!!!!!
Good god! An "Al Gore invented the internet" joke is combinied with a "stupid patent idea" joke! The originality of the average slashbot never ceases to amaze me! You should send some of your jokes to illiad so he can put them in user friendly!
--Shoeboy

Share
twitter facebook
Finally, the tool I've needed for so many years! (Score:1)

by Livn4Golf ( 83604 ) writes:

SELECT * FROM Internet WHERE SubjectOfPic = "Natale Portman" AND Grits = "Hot Pouring"
Yeah, okay. (Score:4)

by 1010011010 ( 53039 ) writes: on Sunday November 26, 2000 @07:17PM (#599524) Homepage

How will this be different from Google's back-end query interface? I ask, because I can't imagine someone making a "screen-scraping" search engine that returns bits of data and not just a link. They will probably get sued by the owners of the purloined content. Plus, parsing HTML to extract one little field of data is tricky, and highly dependant on the layout of the page. I've written a number of things to do just that, from Amazon, IMDB, Borders, finance.Yahoo.com, etc., for my own purposes. I wrote them in both C and Perl. It's a job keeping the filters updated to accomodate the changes in page layout style, regardless of language. Good luck to them and all, but until we have an XML + XSL web, with standard DTDs for the XML, forget it.

________________________________________

Share
twitter facebook
Re:System Requirements (Score:1)

by mr_gerbik ( 122036 ) writes:

cache? download? internet? i must go consult dictionary.com and i'll get back to you.
Re:How much better could this be in terms of PR (Score:1)

by jerky ( 22019 ) writes:

MS Driod: Its good, we don't have anything like it, lets buy the company.
Tech Dude: Uhh, Sir, this runs on Linux. We *CANT* aquire them.

Unfortunately companies whose products run on Linux can be bought in exactly the same way as other companies. Just ask Cobalt.

Those products can be ported to other operating systems too, even Windows. (Shocking, but not every port is a game port to Linux)
Re:FreeQL ? (Score:1)

by atomic pixie ( 258251 ) writes:

Correction. You may GNU/FreeASK for forgiveness.
Re:but how do you pronounce it? (Score:2)

by PD ( 9577 ) writes:

WQL - wackle
SQL - squeal
Re:How much better could this be in terms of PR (Score:1)

by borgboy ( 218060 ) writes:

My company is forced to use Oracle because our clients "don't trust PostgreSQL nor MySQL." Hmmm. Why? PostgreSQL is a robust, trasactional RDBMS that is unfortunately missing the PHB name recognition so vital today. mySQL on the other hand is JustAnotherFileSystemAPI. Don't need transactions now, you say? ACID? Who needs that?
Re:FreeQL ? (Score:2)

by PD ( 9577 ) writes:

The appropriate response to an ASK is a NASK.
Sound a bit like NQL? (Score:1)

by animallogic ( 225329 ) writes:

I don't know about you guys, but without going into detail on the product, this sounds a heap like NQL which itself is a great tool that I have used.
I dunno, just seems to do the same stuff...
just a pretty interface (Score:4)

by _|()|\| ( 159991 ) writes: on Sunday November 26, 2000 @07:44PM (#599532)

I grab the html and apply various forms of on the fly text processing
I downloaded the WebQL Business Edition manual. Here's an abbreviated version of the first example query:

select text("($206$\s+\d{3}-\d{4})","","","T") from http://foo/bar.html where approach=sequence("1","10","1","XX")

The select clause accepts a variety of functions, of which text() seems to be the most useful. You can see that the first argument is a regex designed to match phone numbers. The from clause is an URL. The where clause primarily takes the approach "descriptor," which can crawl or guess new URLs.
So basically, it doesn't do anything a Perl script can't. It just presents a simpler interface.

Share
twitter facebook
Revolutionary? I think not. (Score:1)

by psocccer ( 105399 ) writes:

After browsing for several minutes (eternity is /. time) I have found almost zero information about it. Just "When you see it, you'll want it" crap. It appears, from what I can scrape together, that there must be some spider that populates a database and you run querys against it. Does this sound familiar? If you've ever been on the internet it better. It doesn't seem to be any more than a personal search engine, after all I can't believe that you'd just type in something and it'd magically pull up websites you didn't already tell it to search, either directly or inderectly through links.
What this reminds me of is those wolf programs from a couple of years back. They made FTPWolf, WebWolf, MP3Wolf, WarezWolf and some others. They were just little client spiders that scoured webpages for keywords and followed links. They started with search engines and followed the results to other pages and followed those pages, and on and on. Nothing that hasn't been done before.
The part that would make it useful, and which they claim to do, is comparative searches. i.e. Show me all the latest P4 prices by vendor, or what's the difference between these 3 drills. They mention on the products page that anyone familiar with perl and HTML can use it in no time. I would think anyone familiar with perl and the LWP or Net::FTP modules could create this system in no time. They say there are some wizards, but when have wizards been able to really do what a user wants?
Re:Sound a bit like NQL? (Score:1)

by animallogic ( 225329 ) writes:

Sorry, I forgot to also let you guys know you can find out more about NQL (Network Query Language ) here [networkquerylanguage.com]
Re:How much better could this be in terms of PR (Score:1)

by atomic pixie ( 258251 ) writes:

I agree that this represents another step towards the acceptance of Linux as an enterprise operating system, but I feel that you were a little too superficial in your analysis of open source database solutions versus commercial solutions.

I don't have much experience with PostgreSQL, but MySQL, while an excellent database solution, simply is not fully comparable to Oracle. I don't doubt that many companies use an expensive proprietary solution where an open source one would suffice, but there are applications that are simply require a commercial system such as Oracle. One of the difficulties is that MySQL simply doesn't scale as well. For smaller solutions, it may be ideal, but for a massive database that must accommodate hundreds of simultaneous users, MySQL's locking limitations become both apparent and debilitating.

Personally, I look forward to the time when open source software solutions will make commercial database applications obsolete, but onfortunately, that time hasn't come yet.
simple interface? (Score:2)

by cpeterso ( 19082 ) writes:

select
text("($206$\s+\d{3}-\d{4})","","","T")
from
http://foo/bar.html
where
approach=sequence("1","10","1","XX")

I wouldn't exactly call that a simple interface! ;-)
QUICK! Someone mod this up to 5:Funny! (Score:1)

by grytpype ( 53367 ) writes:

Oh, ye gods, if only I had some mod points! And if I could use them all to mod this post up to its rightful level! Curse you, fate! CURSE YOU!!!!
Re:I like it. (Score:1)

by nconway ( 86640 ) writes:

Just imagine if Web sites could make ready-made WebSQL scripts for their portals for users to download and use with their favorite spreadsheet or database.
You mean exactly like RSS? It's XML based and numerous web sites (including Slashdot and Freshmeat) use it. That's how Slashdot's Slashboxes work. You could set it up like so: user creates an account @ foo.com. The user creates a portfolio (say, when buying stocks, or to get news updates on those companies). The website gives you an option of getting automatically regenerated RSS summaries of your portfolio and the latest prices (as well as RSS of market indices, currency/commodities, etc). You can use any scripting language with an XML parser (i.e. basically any) to process the XML and present it to the user. And all thjs without needing to buy a $499 software product, or use 6GB of local disk space.
From what I've heard, WebQL doesn't sound that innovative (if it even works at all). But since there was no technical info in the article, it's hard to tell.
Freely-Available Web Query Languages (Score:5)

by Ellen Spertus ( 31819 ) writes: on Sunday November 26, 2000 @08:00PM (#599539) Homepage

For my thesis [mills.edu], I created a Web query system called ParaSite. The best introduction is the paper Squeal: A Structured Query Language for the Web [www9.org], which I presented at the World-Wide Web Conference [www9.org]. Anybody is welcome to use my code, algorithms, or ideas.

See also WebSQL [toronto.edu] and W3QL [technion.ac.il], which also come from academia.

Share
twitter facebook
Here (Score:2)

by perdida ( 251676 ) writes:

is a link to the ZD revue of the Biz version.

meow [zdnet.com]

Diffs between this and Google, for instance, abound. Central is the fact that it's not limited to urls.

"Version 1.0 of WebQL uses a wizard to simplify writing queries, but only users with SQL experience will be able to create useful queries. (Ordinary-language queries will be supported in future versions.) The wizard lets you select whether to return text, URLS, table rows or columns, or any combination thereof. You can then specify to search for text, regular expressions, or table cells, and you can add refinements such as case sensitivity and the number of matches returned per page."

I will buy this when it supports ordinary language queries.

Through its access to directories will this thing allow you to bypass registrations on all sites? Pay sites?

How about an image search? (Since people don't name their files informatively all the time..)
Do you think this is (Score:1)

by Big Bad Benny ( 258263 ) writes:

Too much like Microsoft.NET?
on smarter search engines (Score:1)

by _|()|\| ( 159991 ) writes:

I can't imagine someone making a "screen-scraping" search engine that returns bits of data and not just a link. They will probably get sued by the owners of the purloined content.
I don't see any indication that Caesius intends to start such a search engine. WebQL is just a web crawler.
If someone did, the primary defense would be fair use. Some search engines already display an abstract in the search results. On the other hand, I think eBay won a case against (or bullied into submission) a site that crawled their auctions. U.S. courts don't seem to like deep linking, let alone data extraction. Something about the God-given right to banner ad impressions. Next thing you know, a U.S. Marshall will break down my door because I'm using the Internet Junkbuster proxy [junkbusters.com]. I did post anonymously, right?
One man's ceiling ... (Score:1)

by _|()|\| ( 159991 ) writes:

I wouldn't exactly call that a simple interface! ;-)
Like they say about Perl: it makes the easy things easy, and the hard things possible. The average SQL user could probably learn WebQL syntax, although regexes can be complicated. (I didn't realize how complicated until I read Mastering Regular Expressions [oreilly.com].) On the other hand, writing a web crawler in Perl may be beyond his reach.
That said, it wouldn't take much work to cook up a little language like this to wrap a Perl web crawler. I certainly wouldn't pay $500 for this proprietary package.
yet another gay web language (Score:1)

by mpak ( 247326 ) writes:

I'm sorry but I just can't seem to get excited. I think time would better spent trying to improve the existing ones, rather introducing another set of bugs and security holes into the mix.
I don't know about this (Score:3)

by Alien54 ( 180860 ) writes: on Sunday November 26, 2000 @08:27PM (#599545) Journal

with all of the varient site structures, never mind security issues, pay sites, and things like Microsoft constantly rebuilding/breaking its' website, it is hard to see how it would be better results then any meta search across the common search engines sites with prebuilt indexing, etc.
especially with the web running at well over a billion pages by now. Just think of the time to query a billion pages all around the planet, never mind on a small business line, with say a dsl line (forget modem!)
but then I don't get the big bucks for this either....

Share
twitter facebook
Re:How much better could this be in terms of PR (Score:2)

by vectro ( 54263 ) writes:

Erhm, perhaps you might want to visit the product info [webql.com], where it mentions:
Server Component System Requirements: Linux

The client runs on Windows, but the server is for linux.
Re:phurst! (Score:1)

by Anonymous Coward writes:

Posting goatse.cx links makes baby Anne Marie cry.
Re:Do you think this is (Score:1)

by vectro ( 54263 ) writes:

Not really. Actually, it dosen't have anything at all to do with .NET.
+1: Interesente (Score:1)

by _|()|\| ( 159991 ) writes:

For my thesis, I created a Web query system called ParaSite.
Thanks for the links. This looks more compelling than the WebQL product.
The first example in the WebQL manual demonstrates a regex to extract phone numbers. One of your first examples appears to implement the Google technique. I find the latter more interesting.
Been there, done that. (Score:1)

by PDHoss ( 141657 ) writes:

Thunderstone [thunderstone.com] Texis... web spider, regex scraper and SQL-compliant RDBMS. Oh, and a _tad_ more maturity and market experience. Of course, the price tag is a bit steep...

WebQL seems kinda like a friendlier name for an already-established market.

PDHoss

======================================
Re:FreeQL ? (Score:1)

by -brazil- ( 111867 ) writes:

Um, how about Unix? Basically, it was created by the open source community, though it wasn't called that back then.
Re:Freely-Available Web Query Languages (Score:1)

by -brazil- ( 111867 ) writes:

Thanks for setting a bad example so that the rest of us can learn from you how not to behave.
I can see it now.... (Score:1)

by Chairboy ( 88841 ) writes:

SELECT sex.image, text.description
FROM web_images AS sex, web_text AS text
WHERE sex.primary_key = text.primary_key
AND text.description LIKE UPPER('%NATALIE%')
AND text.description LIKE UPPER('%PORTMAN%')
AND text.description NOT LIKE UPPER('%GRITS%')

Woo-hoo! Our sweet mother of Akamai accelerated download, don't fail me now!
Re:Looks like a SPAMmer's dream (Score:1)

by atomic pixie ( 258251 ) writes:
If it were as simple as that, anti-spammed e-mail addresses could already be harvested using a perl script.

The phone number example works because all the phone number in the test data were formatted either identically or very similarly. Obfuscated e-mail addresses, though, are designed to be unpredictable. Regular expressions work wonderfully for finding text that matches a pattern, but there has to be a pattern. If you look at all the different means of de-spamming an address displayed on Slashdot, it fairly apparent that no real pattern is being displayed.

Look at these examples, and try to come up with a pattern that they all match and an algorithm that will turn them all into valid e-mail addresses.
- atomic_pixie@com.hotmail
- atomic_pixie at hotmail dot com
- atomic_pixie NO hotmail SPAM com
- com@atomic_pixie.hotmail - a@b.c becomes b@c.a
It's pretty apparent that no computer algorithm is going to have much luck with this, and I wasn't even being especially creative. Actually, i wasn't being creative at all, since I just ripped that off from other people's addresses. :)
Re:FreeQL ? (Score:1)

by while ( 213516 ) writes:

Unix wasn't created by the OSS "community". Feel free to read up [everything2.com] on a cursory history of it.
It's worth mentioning that BSD is a descendant, and Linux is a clone of Unix. The UNIX source was available, but by no means could its license fit the Open Source definition. MANY people had illegal copies of the the copyrighted code, however, which was likely one of the primary inspiration for Free/Open Source software later on.

(end comment) */ }
Re:WebQL sounds good but... (Score:1)

by atomic pixie ( 258251 ) writes:

Isn't the Web really a giant database without 'WebQL'???

No.

A proper database would have better indexing and control of data integrity. The web is mostly crap. If the web is a database, its DBA should be shot.
goof? (Score:5)

by fluxrad ( 125130 ) writes: on Sunday November 26, 2000 @09:28PM (#599557)

>drop table internet;
OK, 135454265363565609860398636678346496
rows affected.

"oh fuck"

FluX
After 16 years, MTV has finally completed its deevolution into the shiny things network

Share
twitter facebook
Re:but how do you pronounce it? (Score:1)

by fluxrad ( 125130 ) writes:

sql == sequal. (or see-qwill for the phonically inclined).

FluX
After 16 years, MTV has finally completed its deevolution into the shiny things network
Re:FreeQL ? (Score:1)

by -brazil- ( 111867 ) writes:

Define "Open Source definition" :) And AFAIK, legal version of the source code were initially given to universities almost for free (the commercial source licenses cot an arm and a leg though) which was one reason why it spread so quickly.
Hmmm. (Score:1)

by Elgon ( 234306 ) writes:

Interesting one this. I note that it is based on MySQL, a lovely, wonderful, useful toy if ever there was one - may the contributors to it have fast ping times, high data tranfer rates and few systems crashes.

I am curious to know exactly how it sorts through the data though: does it refer to some kind of externally held central database server via the 'net which is continually updated? I fail to see how else such a system could be truly usefully maintained otherwise to an acceptable standard of accuracy.

Elgon
Re:FreeQL ? (Score:1)

by while ( 213516 ) writes:

Try this: http://www.opensource.org/osd.html [opensource.org]

(end comment) */ }
Re:Yeah, okay. (Score:1)

by web_angel_tr ( 198328 ) writes:

Plus, parsing HTML to extract one little field of data is tricky, and highly dependant on the layout of the page.

Besides the layout of the page there is always the posibility of (big?) mistakes in the HTML code. I work for a big German Newspaper Company and know for a fakt that even sites with big money have a lot of mistakes in HTML pages (which are somehow corrected by Netscape/ IE).

So you are damn right. Until we have a standard for all websites, which will take a lot of time, forget it!!!

--
The Semantic Web (Score:3)

by hemul ( 16309 ) writes: on Sunday November 26, 2000 @10:53PM (#599563)

WebQL looks like an interesting hack, but have a look at the semantic web project for people trying to do it properly.
The Semantic Web Page [semanticweb.org] is a good starting point.
TBLs personal notes [w3.org] Is another one. Probably the best one, actually.
"The Semantic Web" was a term coined by Tim Berners-Lee (we all know who that is, don't we?) to describe a www-like global knowledge base, which when combined with some simple logic forms a really interesting KR system. His thesis is that early hypertext systems died of too much structure limiting scalability, and current KR systems (like CYC) have largely failed for similar reasons. The Semantic Web is an attempt to do KR in a web-like way.
This really could be the next major leap in the evolution of the web. Do yourself a favour and check it out. And it's not based on hacks for screen-scraping HTML, it's based on real KR infrastructure.

Share
twitter facebook
Parsing Information from the web page (Score:2)

by jayhop ( 242673 ) writes:

For simple techniques (without learning or any kind of intelligence) such as regular expression to extract or label contents from web pages, you won't expect a good coverage from pages written in all kinds of templates and with so many types of errors.

Right now I'm writting a Java program to extract links from Google search results (easy, don't shoot! Academic use only). What I'm using is OROMatcher [savarese.org], one of the best regular expression packages for Java. I'll say it's still a mission impossible to get 100% recall and be error-free even for this simple task.

The formal name of such a program (labelling and extracting contents) is a "wrapper". Probably the only way to improve the efficiency of a wrapper is to apply machine learning techniques. A well-trained wrapper program with good learning algorithm could be smart enough to adapt to HTML coding formats with small variances. A good example is in this paper [washington.edu].
Re:Hey Taco (Score:1)

by WildBeast ( 189336 ) writes:

yeah good point but me I'm too lazy to type : select lastname, firstname, email, title, job, age, location, country, etc from ...
Re:FreeQL ? (Score:1)

by eudas ( 192703 ) writes:

just think... if the commercial licenses had been cheaper perhaps we wouldn't be seeing all these windows boxen today, as more things would have been developed on and for unix. then again, maybe not. *shrug*

eudas
Re:Hey Taco (Score:1)

by GruffDavies ( 257448 ) writes:

Select * would be meaningless in this context,so you couldn't use it anyway. The URLs data is not in columns, so you have to select a comma separated list of functions which operate on the data. Dur.
Re:How much better could this be in terms of PR (Score:1)

by quietbit ( 258398 ) writes:

actually a linux version exists and it will be available soon

heh :)
Re:Freely-Available Web Query Languages (Score:1)

by quietbit ( 258398 ) writes:

Some people seem to have trouble grasping the the abilities of webql, and that is ok. Not everyone wants to see or know where they are going, nor do they want help to get there.
Re:What is it? (Score:1)

by quietbit ( 258398 ) writes:

webql is a data extraction/data aggrigate/web crawling/data mining tool.
Re:Watch out (Score:1)

by billybob2001 ( 234675 ) writes:

...You might wind up on the end of a Caesius and Desist order.
See the unbroken link here [zdnet.com] for proof that this is on-topic and funny.
Spam? (Score:2)

by mindriot ( 96208 ) writes:
The webql site info reads
- Market Research
- Aggregate Information of any Kind
- Develop Targeted Contact Lists
Sounds like nothing but a spam e-mail address collector to me.

...and, it's not free. So forget it.
Excellent, but... (Score:2)

by laoman ( 141742 ) writes:

I agree this is an excellent idea. Personally, I enjoy working both with on-line stuff and with databases (although my grades in both DB courses I've taken while at school were among my lowest).
However, a proprietary piece of software - sold for $450 is not the best way to surface an excellent idea. What we need is a protocol: a common query language for searching the web that will be easily supported by today's available search engines. Something like this would enable programmers to easily interface their programs with web search engines (which i guess is a good thing).
Also, if their manual is correct, no inserts, updates or deletes are allowed. A carefully drafted protocol like the one mentioned above should support all these, e.g. for adding documents into search engines, removing deleted web sites, coping with new URLs and so on.
Imagine:
delete * from Yahoo where errcode = 404
update Yahoo set url = redirected_url where redirecton = True
Re:Parsing Information from the web page (Score:2)

by Bazzargh ( 39195 ) writes:

Put the pages through a normalisation stage first. - e.g. the HTML Tidy utility at
http://www.w3.org/People/Raggett/tidy/
Unless what you are searching for is broken html, your life will be improved by this step...

BTW, using a regular expression matcher to pull out information from HTML is not the smartest idea. You should use a parser to do the job. I can see why you would do what you've done - e.g. the html doesn't parse, and you don't want to guess all the tricks that MS/NS use to fix luser code - but still, you're better off passing the html through a tidying step, then using a proper parser. It's not like you can't get HTML parser code for free these days. Since you use java, look at javax.swing.text.html.parser .

-Baz
Re:Looks like a SPAMmer's dream (Score:2)

by pen ( 7191 ) writes:

IMHO, 99% of all Web/Internet users post their plain email address one time or another. That provides sufficient volume for the spammers. I highly doubt that they care about the obfuscated addresses, especially because the a person who obfuscates her email address is much more likely (in terms of probability) to be a person who reports spammers to their ISPs repeatedly.

--
RDF (Score:1)

by johnfoobar ( 258419 ) writes:

and this is different from RDF how? an RDF inference engine (together with agreed metadata conventions, like those being worked on by dublin core [purl.org]) would provide the basis for queries against metadata on the web, and that in my opinion what is what's important for the future evolution of the web.
if you haven't looked into RDF and the importance of metadata on the web, there's no time like the present [w3.org].
it wouldn't hurt to read weaving the web [w3.org], by tim berners-lee [w3.org], the inventor of the world-wide web, either. he has chapter 1 online.
Whoa, take it easy peeps (Score:2)

by segmond ( 34052 ) writes:

I am appauled by the large number of posts that I have read already bashing this thing. Did you guys just read the news article? if that is all you did shame on you. Go to the site, download the manual http://www.webql.com/webqlmanual.zip (sorry, I don't create clickable links, cut and paste it in) Anyway, this is a nice idea, I once wanted to gain an edge on ebay when I was once addicted, so I wrote a program to allow me to query ebay, with my program I can query all ended auctions, and find out which items were in demand by the number of bids, which items sold the most, using such knowledge, I can try to find such items and sell them on ebay. Using such a program, you can query all ended auctions, find out which auctions are not in demand, then find if there are any thing you could use from those auctions.

What I am pondering about tho, is if someone will soon make an opensource implementation, if so, will that be fair? I mean, if I started a company with a neat idea, and 3 months later, someone cranked out an opensource version of my product, I do be heartbroken. Ah well... :)
What we need. (Score:1)

by Another AC ( 151302 ) writes:

Is a standard protocol for passing hashes to a server and receiving a hash in reply. Then all sorts of different platforms and programming languages could create a wrapper to that interface. So in perl you could call
%returnhash = sendhash('domain.com','port',%hash);
or something like that.
I've written an interface to CyberCash and to the Tucows OpenSRS system and they implement this in entirely different ways, that both require installing their own perl libraries and learning their own syntax. They could easily have been implemented in a one standard way though. There just doesn't seem to be one for everybody to use.
Basically this seems to be what ms is trying to come up with with .net. It doesn't seem particularly difficult though. Really that's all we need: to be able to pass a hash to a domain:port and to get a hash back (probably there'd be a standard status field... 200, 404, etc.. but the rest would be whatever fields are definied by their API). Services like MapQuest could return a jpg when you pass it an address in your hash. Services like slashdot could return an article (or all articles, or whatever). Services like etrade could return stock prices. We can get all this information already, but with this standard interface there'd be no parsing html or crazy hacks like WebQL!
Does anyone know if there's anything like this already? And if so, why nobody uses it? And if not, ideas on how to get it out there?
Re:Yeah, okay. (Score:2)

by Nexx ( 75873 ) writes:

Actually, I don't think you need standard DTDs, as long as each site stuck with their own DTDs long enough to make it worthwhile. At that point, you can build a custom XSL for each site to transform their content to conform to your DTD, and you can parse that.

--
Re:Looks like a SPAMmer's dream (Score:1)

by atomic pixie ( 258251 ) writes:

I'm sure you're right, but the original poster was concerned about anti-spammed e-mail addresses.
webMethods did this in 1997 (Score:1)

by bziman ( 223162 ) writes:

This is really not a new idea. webMethods, Inc. [webmethods.com] submitted the Web Interface Definition Language [w3.org] to the W3C [w3c.org] back in 1997.
There's a chapter or two written by Charles Allen about WIDL in the XML Handbook (Goldfarb, et al).
But it's a technology that is dated now -- webMethods has moved on to B2B [b2b.com], and anyone who is jumping up and down about screen scraping in 2000 is just a little bit behind the times.
--brian
Re:webMethods did this in 1997 (Score:1)

by quietbit ( 258398 ) writes:

webql is an extremely powerful data extraction crawler for the masses, how can you even compare webql to webmethods?
Re:Hey Taco (Score:1)

by Old Wolf ( 56093 ) writes:

Don't you feel slutted now
Stage plant (Score:1)

by Old Wolf ( 56093 ) writes:

"quietbit", you are all over this page writing little snide comments followed by ":)" , saying that WebQL is "obviously" so good and everything else is "obviously" shit, and that you edited their webpage, and so on.

Funny how you were silent when the guy who wrote an open source Web SQL thing posted his work earlier.
Re:goof? (Score:1)

by Old Wolf ( 56093 ) writes:

I bet you use UNIX. Here's my proof:

Even if you could delete one row per picosecond, this would take more than 10^14 years to accomplish.
Therefore your OS would still have to be up so you could see the result.
Two Misunderstandings (Score:1)

by crucini ( 98210 ) writes:
I'm noticing some recurring confusion in the comments about this product.
1. Isn't this the same as a search engine? No, a search engine stores indexing information in a central repository and makes it available to users via a web interface. This product is more of a web client library, like Perl's LWP, intended to make HTTP requests to non-cooperating sites and organize the resulting hodge-podge of HTML into usable data.
2. Isn't "screen scraping" kind of lame when we could use XML/RDF/semantic markup/other buzzword? No. The problem is that you, the data gatherer do not control the web sites from which you're gathering data. You might wish that greedy.com made their data available in a friendly format, but what incentive do they have to do this? They're interested in getting eyeballs glued to their site, not in feeding your big data harvester. In fact, when high bandwidth becomes widely available, maybe commercial sites will start delivering whole pages as dynamically generated GIF's. There's a fundamental conflict of interest between the commercial web publisher and the data consumer. The publisher wants to dilute his teaspoon of info in a bucket of glitz and junk. The user wants to refine the messy end-product and extract the data, whether he uses his brain or software to do it.
Re:Stage plant (Score:1)

by quietbit ( 258398 ) writes:

I wont bash open source, but open source is not always equivalent or better than commercial products.

People will come to their own conclusions, and it appears many people posting here have ignorantly predetermined their stance on WebQL. So many are so far above the use of webql, possibly narrowed vision because of the specific applications that are highly accepted which can only do a portion of what WebQL is capable of. It is amazing how people dont want to change.

That's fine with me, don't change and you will fall behind.

IMHO, WebQL is THE end all data extraction/web crawling/data mining tool for the masses.

You dont have to use WebQL, just like windows users don't have to use Linux.
The Relation Arithmetic Alternative (Score:2)

by Baldrson ( 78598 ) writes:

A while back, I posted an article on an alternative to the Tim Berner-Lee's Semantic Web [geocities.com] based on the aspect of Bertrand Russell's work that Russell thought was his most under-rated achievement: Relation Arithmetic.
Here is the intro:
The future of the Internet is in what I call "rational programming" derived from a revival of Bertrand Russell [gac.edu]'s Relation Arithmetic [geocities.com]. Rational programming is a classically applicable branch of relation arithmetic's sub theory of quantum software (as opposed to the hardware-oriented technology of quantum computing [rdrop.com]). By classically applicable I mean it is applies to conventional computing systems -- not just quantum information systems. Rational programming will subsume what Tim Berners Lee [ruku.com] calls the semantic web [w3.org]. The basic problem Tim (and just about everyone back through Bertrand Russell) fails to perceive is that logic is irrational. John McCarthy's [stanford.edu] signature line says it all about this kind of approach: "He who refuses to do arithmetic is doomed to talk nonsense."
Re:Freely-Available Web Query Languages (Score:1)

by quietbit ( 258398 ) writes:

Alright I will reply to this.

I have a question, how are you supposed to use this? It's as difficult as NQL.

You don't have to be programmer to grab a script from the script repository [webql.com] and plug it into webql.

How is it that WebSQL is so good and so not used? WebQL has been used by several companies to gather more information than could be handled. If WebSQL or any of the other virtual "Web query systems" are so good and so free, why aren't they accepted as a solution? Maybe they are better because they are free. It must be the price tag... So it has nothing to do with which is the better technology, it has to do with which one is cheaper. We're on the same wave length now.
Niagara - Internet Query System (Score:1)

by Brasidas ( 121268 ) writes:

and see also : http://www.cs.wisc.edu/niagara/
Re:FreeQL ? (Score:1)

by -brazil- ( 111867 ) writes:

Unix was created at Bell, for sale, by people who at first did it more for sport than as work. The source was readily available, very cheaply for academic institutions. Those are the facts. What's your problem.
Re:simple interface? (Score:2)

by dodobh ( 65811 ) writes:

Compared with a Perl script, the ointerface is simple. :)
Re:Stage plant (Score:1)

by mparcens ( 76207 ) writes:

(This comment is late, so you may not even read it, but...)

You can't be serious. Your blanket statements concerning your stance on WebQL either means you worked on the project or you're really really protected from Regular Expressions.

WebQL is nothing more than a regexp wrapper. How do I know this? I've been working to implement the WebQL language in PHP this week. Here [207.244.81.196]. It took me three days to replicate about half of the language. I would have assumed that this "amazing" product was a bit tougher to implement.

Saying that WebQL is the "end all data mining tool" is a serious short-sight on your part. The WebQL language is limited, hacked together, and totally reliant on the data being in HTML format. If you are serious about not "falling behind", you'd embrace a more abstract tool, one that dealt with meta data. But of course, I forgot; you're a troll.

______________

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

but how do you pronounce it? (Score:1)

System Requirements (Score:1)

Ingenius! (Score:1)

I have my doubts (Score:2)

How much better could this be in terms of PR (Score:1)

Looks like a SPAMmer's dream (Score:1)

WebQL sounds good but... (Score:1)

I like it. (Score:1)

Databases, web, syndication (Score:2)

I wonder if... (Score:1)

Hey Taco (Score:2)

Re:System Requirements (Score:1)

Re:Ingenius! (Score:3)

Finally, the tool I've needed for so many years! (Score:1)

Yeah, okay. (Score:4)

Re:System Requirements (Score:1)

Re:How much better could this be in terms of PR (Score:1)

Re:FreeQL ? (Score:1)

Re:but how do you pronounce it? (Score:2)

Re:How much better could this be in terms of PR (Score:1)

Re:FreeQL ? (Score:2)

Sound a bit like NQL? (Score:1)

just a pretty interface (Score:4)

Revolutionary? I think not. (Score:1)

Re:Sound a bit like NQL? (Score:1)

Re:How much better could this be in terms of PR (Score:1)

simple interface? (Score:2)

QUICK! Someone mod this up to 5:Funny! (Score:1)

Re:I like it. (Score:1)

Freely-Available Web Query Languages (Score:5)

Here (Score:2)

Do you think this is (Score:1)

on smarter search engines (Score:1)

One man's ceiling ... (Score:1)

yet another gay web language (Score:1)

I don't know about this (Score:3)

Re:How much better could this be in terms of PR (Score:2)

Re:phurst! (Score:1)

Re:Do you think this is (Score:1)

+1: Interesente (Score:1)

Been there, done that. (Score:1)

Re:FreeQL ? (Score:1)

Re:Freely-Available Web Query Languages (Score:1)

I can see it now.... (Score:1)

Re:Looks like a SPAMmer's dream (Score:1)

Re:FreeQL ? (Score:1)

Re:WebQL sounds good but... (Score:1)

goof? (Score:5)

Re:but how do you pronounce it? (Score:1)

Re:FreeQL ? (Score:1)

Hmmm. (Score:1)

Re:FreeQL ? (Score:1)

Re:Yeah, okay. (Score:1)

The Semantic Web (Score:3)

Parsing Information from the web page (Score:2)

Re:Hey Taco (Score:1)

Re:FreeQL ? (Score:1)

Re:Hey Taco (Score:1)

Re:How much better could this be in terms of PR (Score:1)

Re:Freely-Available Web Query Languages (Score:1)

Re:What is it? (Score:1)

Re:Watch out (Score:1)

Spam? (Score:2)

Excellent, but... (Score:2)

Re:Parsing Information from the web page (Score:2)

Re:Looks like a SPAMmer's dream (Score:2)

RDF (Score:1)

Whoa, take it easy peeps (Score:2)

What we need. (Score:1)

Re:Yeah, okay. (Score:2)

Re:Looks like a SPAMmer's dream (Score:1)

webMethods did this in 1997 (Score:1)

Re:webMethods did this in 1997 (Score:1)

Re:Hey Taco (Score:1)

Stage plant (Score:1)

Re:goof? (Score:1)

Two Misunderstandings (Score:1)

Re:Stage plant (Score:1)

The Relation Arithmetic Alternative (Score:2)

Re:Freely-Available Web Query Languages (Score:1)