atom - Slashdot User

Comment Re:Black holes (Score 1) 313

by atom on Tuesday December 14, 1999 @08:26AM (#1466141) Attached to: Is the Internet Becoming Unsearchable?

One think you could have done to avoid this is to compute a checksum of the html that you downloaded. You can keep a table/database of checksums of every page that you downloaded so far on a given site. You should junk any page with a duplicate checksum.

Another thing you should do is record each url that you download and make sure that you don't download the same url multiple times.

One way that sites screw this approach up is to append a unique session id to each url. You might need to keep track of the sessionid or else you might get into an infinate loop of downloading the same page, but with a different sessionid. The checksum thing might get around this problem.

I'm also writing a spider, but the emphasis is on indexing dynamic pages. (product pages at ecommerce sites).

Slashdot Top Deals