WebQL Turns the Web Into A Giant Database 84
An anonymous reader says "
This article was posted on ZDNet by Bill Machrone on a new type of query language for aggregating information from the Web." Somewhat light on the details, but definitely something to think about.
but how do you pronounce it? (Score:1)
--
A mind is a terrible thing to taste.
System Requirements (Score:1)
6GB HD suggested (10MB required to install)
Must use 5.99 GB of virtual memory I guess.
Ingenius! (Score:1)
I have my doubts (Score:2)
Remember, remote content is not under your control. It will change (often) and is very very likely to not have a nice structure, and is even more likely to contain mismatched tags and other errors.
OK, its in its infancy, but IMHO if/when XHTML is widely adopted, a special query language or tool will largely be irrelevant because most of what is alleged in that brief article could be done in the magical wonderful world of XML.
How much better could this be in terms of PR (Score:1)
A company comes out with a product that is atleast fairly cool. If it does nothing that it says it will, it atleast will turn some heads and get some attention. So at its worst, this is AMAZING PR for linux in general. Can't you just picture it now:
MS Driod: Its good, we don't have anything like it, lets buy the company.
Tech Dude: Uhh, Sir, this runs on Linux. We *CANT* aquire them.
Beyond that, this reaffirms the idea that Linux is a valid operating system for the enterprise, and MySQL as a valid Data base solution. Even though I know this, and you probably know this, what matters is the the VP's and VC's know this. My company is forced to use Oracle because our clients "don't trust PostgreSQL nor MySQL."
So the way I see it, we win!
--Alex the Fishman, (very proud fishman)
Looks like a SPAMmer's dream (Score:1)
WebQL sounds good but... (Score:1)
Isn't the Web really a giant database without 'WebQL'??? If the web isn't already a database of sorts, then what is it???
I like it. (Score:1)
What's great is that this program could, if scaled down into a personal edition, make the Internet much more accessible and useable for novices. An acquaitance once asked me to show him how to do basically the same thing: he wanted a macro to automatically import his stock portfolio information from a web page into a spreadsheet. He was kind of spoiled by a simple TRS-80 BASIC program that he had once used in the hey-days of computing that automatically dialed up Compuserve (or some other online service, I don't recall which) that would automatically download and parse stock information from one of the Compuserve forums. I had to tell him that there's no simple way to parse HTML like that, but it would have been much better if I could have pointed him to a personal version of WebSQL to use for exactly this. Just imagine if Web sites could make ready-made WebSQL scripts for their portals for users to download and use with their favorite spreadsheet or database.
Not to mention what I could do with such a fun toy. :) If they make a personal edition, and it works as advertised, I'll buy a copy for sure.
Databases, web, syndication (Score:2)
The breakthru is that you notice almost everyone (those significant enough) have a web frontend to their database. Now if you can just go via that web front, you don't have to go direct to the database and can bypass all the above issue!
The first company that I know of that does this is an Israelic company: Orsus.com [orsus.com]. Since then, OnePage.com [onepage.com] also does it.
What they (at least Orsus) did was build a language (based on XML) that instructs a web spidering engine that has the ability to parse HTML and Javascript. A GUI IDE (no less) is used by the lay person to write the XML-based code.
I wonder if... (Score:1)
An exerpt - I'm too tired to find the link...
THE COPYRIGHT CONUNDRUM
Another problem is with copyrights and other protections of intellectual property. As we have learned from the recent Napster battles, this can be a real sticky wicket on the Net, where users can so easily and freely trade files regardless of any such protection. Because the current system is intentionally oblivious to what's in those Internet packets being transferred, there's no easy way to protect copyrighted data. With all that in mind, Kahn decided it was necessary to develop a new framework for dealing with all the information on the Internet that would act as a layer above the existing infrastructure but deal with the "what" as much as the "where." So through the 90s, while most of the world was just discovering the Internet, Kahn was working on how to reinvent it. His new system is called the "Handle System." Instead of identifying the place a file is going to or coming from, it assigns an identifier called a "handle" to the information itself, called "digital objects." A digital object is anything that can be stored on a computer: a web page, a music file, a video file, a book or chapter of a book, your dental x-rays - you name it. Similar to the way a host name is resolved to an IP address, the handle will be resolved into information the computer needs to know about the object. Only since the information is now about the object, the location of the object is just one of the bits of information that is important. The handle record will also tell the computer things like what kind of file the object it is, how often it will be updated and how the object is allowed to be used - whether there are any copyright or privacy protections. The record can also have any industry-specific information about the object, like a book's International Standard Book Number (ISBN) code. There are two other crucial things about these records: First, each handle can have multiple records associated with it - allowing multiple copies of the same information to be stored on different servers or for different systems. Second, the handle record is updated by the owner of the information - something in stark contrast to the host-name data, which is updated by central repository companies like Network Solutions. This will make things like changing the location of the file much more seamless, rather than waiting days for a new IP address to be updated in the DNS server.
Wouldn't merging the querying features with the above "Handle System" seem a wise thing to do? Maybe that's what it already does...
Hey Taco (Score:2)
If you have different developers working on the front end and the database, this will really make them hate each other. It also makes the query optimizer work harder than it needs to (the amount of cpu wasted this way is totally insignificant, but it's bad form anyway.
Also, if you're going to run select * from internet without a where clause, be prepared for an extremely long running query.
--Shoeboy
Re:System Requirements (Score:1)
Re:Ingenius! (Score:3)
Good god! An "Al Gore invented the internet" joke is combinied with a "stupid patent idea" joke! The originality of the average slashbot never ceases to amaze me! You should send some of your jokes to illiad so he can put them in user friendly!
--Shoeboy
Finally, the tool I've needed for so many years! (Score:1)
Yeah, okay. (Score:4)
________________________________________
Re:System Requirements (Score:1)
Re:How much better could this be in terms of PR (Score:1)
MS Driod: Its good, we don't have anything like it, lets buy the company.
Tech Dude: Uhh, Sir, this runs on Linux. We *CANT* aquire them.
Unfortunately companies whose products run on Linux can be bought in exactly the same way as other companies. Just ask Cobalt.
Those products can be ported to other operating systems too, even Windows. (Shocking, but not every port is a game port to Linux)
Re:FreeQL ? (Score:1)
Re:but how do you pronounce it? (Score:2)
SQL - squeal
Re:How much better could this be in terms of PR (Score:1)
Re:FreeQL ? (Score:2)
Sound a bit like NQL? (Score:1)
I dunno, just seems to do the same stuff...
just a pretty interface (Score:4)
I downloaded the WebQL Business Edition manual. Here's an abbreviated version of the first example query:
The select clause accepts a variety of functions, of which text() seems to be the most useful. You can see that the first argument is a regex designed to match phone numbers. The from clause is an URL. The where clause primarily takes the approach "descriptor," which can crawl or guess new URLs.So basically, it doesn't do anything a Perl script can't. It just presents a simpler interface.
Revolutionary? I think not. (Score:1)
What this reminds me of is those wolf programs from a couple of years back. They made FTPWolf, WebWolf, MP3Wolf, WarezWolf and some others. They were just little client spiders that scoured webpages for keywords and followed links. They started with search engines and followed the results to other pages and followed those pages, and on and on. Nothing that hasn't been done before.
The part that would make it useful, and which they claim to do, is comparative searches. i.e. Show me all the latest P4 prices by vendor, or what's the difference between these 3 drills. They mention on the products page that anyone familiar with perl and HTML can use it in no time. I would think anyone familiar with perl and the LWP or Net::FTP modules could create this system in no time. They say there are some wizards, but when have wizards been able to really do what a user wants?
Re:Sound a bit like NQL? (Score:1)
Re:How much better could this be in terms of PR (Score:1)
I don't have much experience with PostgreSQL, but MySQL, while an excellent database solution, simply is not fully comparable to Oracle. I don't doubt that many companies use an expensive proprietary solution where an open source one would suffice, but there are applications that are simply require a commercial system such as Oracle. One of the difficulties is that MySQL simply doesn't scale as well. For smaller solutions, it may be ideal, but for a massive database that must accommodate hundreds of simultaneous users, MySQL's locking limitations become both apparent and debilitating.
Personally, I look forward to the time when open source software solutions will make commercial database applications obsolete, but onfortunately, that time hasn't come yet.
simple interface? (Score:2)
text("(\(206\)\s+\d{3}-\d{4})","","","T")
from
http://foo/bar.html
where
approach=sequence("1","10","1","XX")
I wouldn't exactly call that a simple interface!
QUICK! Someone mod this up to 5:Funny! (Score:1)
Re:I like it. (Score:1)
You mean exactly like RSS? It's XML based and numerous web sites (including Slashdot and Freshmeat) use it. That's how Slashdot's Slashboxes work. You could set it up like so: user creates an account @ foo.com. The user creates a portfolio (say, when buying stocks, or to get news updates on those companies). The website gives you an option of getting automatically regenerated RSS summaries of your portfolio and the latest prices (as well as RSS of market indices, currency/commodities, etc). You can use any scripting language with an XML parser (i.e. basically any) to process the XML and present it to the user. And all thjs without needing to buy a $499 software product, or use 6GB of local disk space.
From what I've heard, WebQL doesn't sound that innovative (if it even works at all). But since there was no technical info in the article, it's hard to tell.
Freely-Available Web Query Languages (Score:5)
For my thesis [mills.edu], I created a Web query system called ParaSite. The best introduction is the paper Squeal: A Structured Query Language for the Web [www9.org], which I presented at the World-Wide Web Conference [www9.org]. Anybody is welcome to use my code, algorithms, or ideas.
See also WebSQL [toronto.edu] and W3QL [technion.ac.il], which also come from academia.
Here (Score:2)
meow [zdnet.com]
Diffs between this and Google, for instance, abound. Central is the fact that it's not limited to urls.
"Version 1.0 of WebQL uses a wizard to simplify writing queries, but only users with SQL experience will be able to create useful queries. (Ordinary-language queries will be supported in future versions.) The wizard lets you select whether to return text, URLS, table rows or columns, or any combination thereof. You can then specify to search for text, regular expressions, or table cells, and you can add refinements such as case sensitivity and the number of matches returned per page."
I will buy this when it supports ordinary language queries.
Through its access to directories will this thing allow you to bypass registrations on all sites? Pay sites?
How about an image search? (Since people don't name their files informatively all the time..)
Do you think this is (Score:1)
on smarter search engines (Score:1)
I don't see any indication that Caesius intends to start such a search engine. WebQL is just a web crawler.
If someone did, the primary defense would be fair use. Some search engines already display an abstract in the search results. On the other hand, I think eBay won a case against (or bullied into submission) a site that crawled their auctions. U.S. courts don't seem to like deep linking, let alone data extraction. Something about the God-given right to banner ad impressions. Next thing you know, a U.S. Marshall will break down my door because I'm using the Internet Junkbuster proxy [junkbusters.com]. I did post anonymously, right?
One man's ceiling ... (Score:1)
Like they say about Perl: it makes the easy things easy, and the hard things possible. The average SQL user could probably learn WebQL syntax, although regexes can be complicated. (I didn't realize how complicated until I read Mastering Regular Expressions [oreilly.com].) On the other hand, writing a web crawler in Perl may be beyond his reach.
That said, it wouldn't take much work to cook up a little language like this to wrap a Perl web crawler. I certainly wouldn't pay $500 for this proprietary package.
yet another gay web language (Score:1)
I don't know about this (Score:3)
especially with the web running at well over a billion pages by now. Just think of the time to query a billion pages all around the planet, never mind on a small business line, with say a dsl line (forget modem!)
but then I don't get the big bucks for this either....
Re:How much better could this be in terms of PR (Score:2)
Re:phurst! (Score:1)
Re:Do you think this is (Score:1)
+1: Interesente (Score:1)
Thanks for the links. This looks more compelling than the WebQL product.
The first example in the WebQL manual demonstrates a regex to extract phone numbers. One of your first examples appears to implement the Google technique. I find the latter more interesting.
Been there, done that. (Score:1)
Thunderstone [thunderstone.com] Texis... web spider, regex scraper and SQL-compliant RDBMS. Oh, and a _tad_ more maturity and market experience. Of course, the price tag is a bit steep...
WebQL seems kinda like a friendlier name for an already-established market.
PDHoss
======================================
Re:FreeQL ? (Score:1)
Re:Freely-Available Web Query Languages (Score:1)
I can see it now.... (Score:1)
FROM web_images AS sex, web_text AS text
WHERE sex.primary_key = text.primary_key
AND text.description LIKE UPPER('%NATALIE%')
AND text.description LIKE UPPER('%PORTMAN%')
AND text.description NOT LIKE UPPER('%GRITS%')
Woo-hoo! Our sweet mother of Akamai accelerated download, don't fail me now!
Re:Looks like a SPAMmer's dream (Score:1)
The phone number example works because all the phone number in the test data were formatted either identically or very similarly. Obfuscated e-mail addresses, though, are designed to be unpredictable. Regular expressions work wonderfully for finding text that matches a pattern, but there has to be a pattern. If you look at all the different means of de-spamming an address displayed on Slashdot, it fairly apparent that no real pattern is being displayed.
Look at these examples, and try to come up with a pattern that they all match and an algorithm that will turn them all into valid e-mail addresses.
Re:FreeQL ? (Score:1)
It's worth mentioning that BSD is a descendant, and Linux is a clone of Unix. The UNIX source was available, but by no means could its license fit the Open Source definition. MANY people had illegal copies of the the copyrighted code, however, which was likely one of the primary inspiration for Free/Open Source software later on.
(end comment) */ }
Re:WebQL sounds good but... (Score:1)
No.
A proper database would have better indexing and control of data integrity. The web is mostly crap. If the web is a database, its DBA should be shot.
goof? (Score:5)
OK, 135454265363565609860398636678346496
rows affected.
"oh fuck"
FluX
After 16 years, MTV has finally completed its deevolution into the shiny things network
Re:but how do you pronounce it? (Score:1)
FluX
After 16 years, MTV has finally completed its deevolution into the shiny things network
Re:FreeQL ? (Score:1)
Hmmm. (Score:1)
I am curious to know exactly how it sorts through the data though: does it refer to some kind of externally held central database server via the 'net which is continually updated? I fail to see how else such a system could be truly usefully maintained otherwise to an acceptable standard of accuracy.
Elgon
Re:FreeQL ? (Score:1)
(end comment) */ }
Re:Yeah, okay. (Score:1)
Plus, parsing HTML to extract one little field of data is tricky, and highly dependant on the layout of the page.
Besides the layout of the page there is always the posibility of (big?) mistakes in the HTML code. I work for a big German Newspaper Company and know for a fakt that even sites with big money have a lot of mistakes in HTML pages (which are somehow corrected by Netscape/ IE).
So you are damn right. Until we have a standard for all websites, which will take a lot of time, forget it!!!
--
The Semantic Web (Score:3)
The Semantic Web Page [semanticweb.org] is a good starting point.
TBLs personal notes [w3.org] Is another one. Probably the best one, actually.
"The Semantic Web" was a term coined by Tim Berners-Lee (we all know who that is, don't we?) to describe a www-like global knowledge base, which when combined with some simple logic forms a really interesting KR system. His thesis is that early hypertext systems died of too much structure limiting scalability, and current KR systems (like CYC) have largely failed for similar reasons. The Semantic Web is an attempt to do KR in a web-like way.
This really could be the next major leap in the evolution of the web. Do yourself a favour and check it out. And it's not based on hacks for screen-scraping HTML, it's based on real KR infrastructure.
Parsing Information from the web page (Score:2)
Right now I'm writting a Java program to extract links from Google search results (easy, don't shoot! Academic use only). What I'm using is OROMatcher [savarese.org], one of the best regular expression packages for Java. I'll say it's still a mission impossible to get 100% recall and be error-free even for this simple task.
The formal name of such a program (labelling and extracting contents) is a "wrapper". Probably the only way to improve the efficiency of a wrapper is to apply machine learning techniques. A well-trained wrapper program with good learning algorithm could be smart enough to adapt to HTML coding formats with small variances. A good example is in this paper [washington.edu].
Re:Hey Taco (Score:1)
Re:FreeQL ? (Score:1)
eudas
Re:Hey Taco (Score:1)
Re:How much better could this be in terms of PR (Score:1)
heh
Re:Freely-Available Web Query Languages (Score:1)
Re:What is it? (Score:1)
Re:Watch out (Score:1)
See the unbroken link here [zdnet.com] for proof that this is on-topic and funny.
Spam? (Score:2)
The webql site info reads
Sounds like nothing but a spam e-mail address collector to me.
Excellent, but... (Score:2)
However, a proprietary piece of software - sold for $450 is not the best way to surface an excellent idea. What we need is a protocol: a common query language for searching the web that will be easily supported by today's available search engines. Something like this would enable programmers to easily interface their programs with web search engines (which i guess is a good thing).
Also, if their manual is correct, no inserts, updates or deletes are allowed. A carefully drafted protocol like the one mentioned above should support all these, e.g. for adding documents into search engines, removing deleted web sites, coping with new URLs and so on.
Imagine:
delete *
from Yahoo
where errcode = 404
update Yahoo
set url = redirected_url
where redirecton = True
Re:Parsing Information from the web page (Score:2)
http://www.w3.org/People/Raggett/tidy/
Unless what you are searching for is broken html, your life will be improved by this step...
BTW, using a regular expression matcher to pull out information from HTML is not the smartest idea. You should use a parser to do the job. I can see why you would do what you've done - e.g. the html doesn't parse, and you don't want to guess all the tricks that MS/NS use to fix luser code - but still, you're better off passing the html through a tidying step, then using a proper parser. It's not like you can't get HTML parser code for free these days. Since you use java, look at javax.swing.text.html.parser .
-Baz
Re:Looks like a SPAMmer's dream (Score:2)
--
RDF (Score:1)
if you haven't looked into RDF and the importance of metadata on the web, there's no time like the present [w3.org].
it wouldn't hurt to read weaving the web [w3.org], by tim berners-lee [w3.org], the inventor of the world-wide web, either. he has chapter 1 online.
Whoa, take it easy peeps (Score:2)
What I am pondering about tho, is if someone will soon make an opensource implementation, if so, will that be fair? I mean, if I started a company with a neat idea, and 3 months later, someone cranked out an opensource version of my product, I do be heartbroken. Ah well...
What we need. (Score:1)
%returnhash = sendhash('domain.com','port',%hash);
or something like that.
I've written an interface to CyberCash and to the Tucows OpenSRS system and they implement this in entirely different ways, that both require installing their own perl libraries and learning their own syntax. They could easily have been implemented in a one standard way though. There just doesn't seem to be one for everybody to use.
Basically this seems to be what ms is trying to come up with with .net. It doesn't seem particularly difficult though. Really that's all we need: to be able to pass a hash to a domain:port and to get a hash back (probably there'd be a standard status field... 200, 404, etc.. but the rest would be whatever fields are definied by their API). Services like MapQuest could return a jpg when you pass it an address in your hash. Services like slashdot could return an article (or all articles, or whatever). Services like etrade could return stock prices. We can get all this information already, but with this standard interface there'd be no parsing html or crazy hacks like WebQL!
Does anyone know if there's anything like this already? And if so, why nobody uses it? And if not, ideas on how to get it out there?
Re:Yeah, okay. (Score:2)
Actually, I don't think you need standard DTDs, as long as each site stuck with their own DTDs long enough to make it worthwhile. At that point, you can build a custom XSL for each site to transform their content to conform to your DTD, and you can parse that.
--
Re:Looks like a SPAMmer's dream (Score:1)
webMethods did this in 1997 (Score:1)
There's a chapter or two written by Charles Allen about WIDL in the XML Handbook (Goldfarb, et al).
But it's a technology that is dated now -- webMethods has moved on to B2B [b2b.com], and anyone who is jumping up and down about screen scraping in 2000 is just a little bit behind the times.
--brian
Re:webMethods did this in 1997 (Score:1)
Re:Hey Taco (Score:1)
Stage plant (Score:1)
Funny how you were silent when the guy who wrote an open source Web SQL thing posted his work earlier.
Re:goof? (Score:1)
Even if you could delete one row per picosecond, this would take more than 10^14 years to accomplish.
Therefore your OS would still have to be up so you could see the result.
Two Misunderstandings (Score:1)
Re:Stage plant (Score:1)
People will come to their own conclusions, and it appears many people posting here have ignorantly predetermined their stance on WebQL. So many are so far above the use of webql, possibly narrowed vision because of the specific applications that are highly accepted which can only do a portion of what WebQL is capable of. It is amazing how people dont want to change.
That's fine with me, don't change and you will fall behind.
IMHO, WebQL is THE end all data extraction/web crawling/data mining tool for the masses.
You dont have to use WebQL, just like windows users don't have to use Linux.
The Relation Arithmetic Alternative (Score:2)
Here is the intro:
The future of the Internet is in what I call "rational programming" derived from a revival of Bertrand Russell [gac.edu]'s Relation Arithmetic [geocities.com]. Rational programming is a classically applicable branch of relation arithmetic's sub theory of quantum software (as opposed to the hardware-oriented technology of quantum computing [rdrop.com]). By classically applicable I mean it is applies to conventional computing systems -- not just quantum information systems. Rational programming will subsume what Tim Berners Lee [ruku.com] calls the semantic web [w3.org]. The basic problem Tim (and just about everyone back through Bertrand Russell) fails to perceive is that logic is irrational. John McCarthy's [stanford.edu] signature line says it all about this kind of approach: "He who refuses to do arithmetic is doomed to talk nonsense."
Re:Freely-Available Web Query Languages (Score:1)
I have a question, how are you supposed to use this? It's as difficult as NQL.
You don't have to be programmer to grab a script from the script repository [webql.com] and plug it into webql.
How is it that WebSQL is so good and so not used? WebQL has been used by several companies to gather more information than could be handled. If WebSQL or any of the other virtual "Web query systems" are so good and so free, why aren't they accepted as a solution? Maybe they are better because they are free. It must be the price tag... So it has nothing to do with which is the better technology, it has to do with which one is cheaper. We're on the same wave length now.
Niagara - Internet Query System (Score:1)
Re:FreeQL ? (Score:1)
Re:simple interface? (Score:2)
Re:Stage plant (Score:1)
You can't be serious. Your blanket statements concerning your stance on WebQL either means you worked on the project or you're really really protected from Regular Expressions.
WebQL is nothing more than a regexp wrapper. How do I know this? I've been working to implement the WebQL language in PHP this week. Here [207.244.81.196]. It took me three days to replicate about half of the language. I would have assumed that this "amazing" product was a bit tougher to implement.
Saying that WebQL is the "end all data mining tool" is a serious short-sight on your part. The WebQL language is limited, hacked together, and totally reliant on the data being in HTML format. If you are serious about not "falling behind", you'd embrace a more abstract tool, one that dealt with meta data. But of course, I forgot; you're a troll.
______________