Comment Re:Black holes (Score 1) 313
One think you could have done to avoid this is to compute a checksum of the html that you downloaded. You can keep a table/database of checksums of every page that you downloaded so far on a given site. You should junk any page with a duplicate checksum.
Another thing you should do is record each url that you download and make sure that you don't download the same url multiple times.
One way that sites screw this approach up is to append a unique session id to each url. You might need to keep track of the sessionid or else you might get into an infinate loop of downloading the same page, but with a different sessionid. The checksum thing might get around this problem.
I'm also writing a spider, but the emphasis is on indexing dynamic pages. (product pages at ecommerce sites).
Another thing you should do is record each url that you download and make sure that you don't download the same url multiple times.
One way that sites screw this approach up is to append a unique session id to each url. You might need to keep track of the sessionid or else you might get into an infinate loop of downloading the same page, but with a different sessionid. The checksum thing might get around this problem.
I'm also writing a spider, but the emphasis is on indexing dynamic pages. (product pages at ecommerce sites).