Follow Slashdot stories on Twitter

 



Forgot your password?
typodupeerror
×

Comment Re:Weak error handling (Score 1) 394

That looks like quite a bit of code.

The definition of "quite a bit" is matter of opinion, but I'm talking about 147 lines of perl (including comments and blanks) and 88 lines of shell scripts serving misc. ancillary functions.

Point 5 means that we are no longer using wget but our own version of wget.

Sure, but so what? It's free software.

It also does not fix the multi connection when host spanning is used. It also does not handle sites that we do not want to crawl that may be connected to sites that we do want to crawl.

The problem can be solved.

Comment Re:Weak error handling (Score 1) 394

Does that mean the operator has to manually monitor the crons and restart the ones that failed?

Here's the thing. Either you're writing code to monitor wget processes, or you're writing code to monitor your custom coded wget-replacement (or equivalent logic within an application not divided into processes). Or you're doing it manually, which may be reasonable.

How do you schedule orbitz.com to go off and then soggy.com to go off later?

Write code that launches wget on your schedule... Why do you think this is hard to do with wget?

What of you are handling hundreds of different web sites? Hundreds of crons? How do you retry later on sites that are very slow at the moment? How would you know that wget timed out due to slow download?

Having done this, I'll tell you how I did it. I don't claim it's exactly pretty, but it works, it's easy to do, and it won't cause problems if you're careful:

I enabled wget's logging facilities and scanned the logs for failures. I kept a queue of wget processes to run, and kept a fixed number of wget processes running at a given time. (I changed the number of processes as necessary by hand, although this might have been handled heuristically to maximize resource usage.)

I'm totally confident this approach could be scaled up to the size of the whole internet, because the task is so easily divided into small sections and you're going to hit bandwidth limits long before number-of-processes limits. First assume that you're using a separate process for each host (not a wget process, but the glue process that runs wget). Are there too many hosts for that many processes? No. Are any hosts too big to be handled by a wget-coordinating script? You may think so, but I know you're wrong because I've seen it...

This is a perfect example of the 80/20 rule. The "solution" may cover 80% of the problem but that final 20% will require so much babysitting as to make it unusable. Wget is not an enterprise level web crawler.

You're right that there is a lot of "babysitting" required; you're wrong that the solution to certain of these problems must be "unusable" -- I know because I've seen them solved. My intuition says the others would be similarly solved.

One thing you might have to do is edit wget source.

Comment Re:which language is best? (Score 2, Informative) 394

Wget for crawling tens of millions of web pages using a 10 line script? He doesn't understand crawling at scale.

Wget is made for crawling at scale.

There's a lot more to it than just following links. For example, lots of servers will block you if you start ripping them in full, so you need to have a system in place to crawl sites over many days/weeks a few pages at a time.

wget --random-wait

You also want to distribute the load over several IP addresses

The way I do this with wget is to use wget to generate a list of URLs, then launch a separate wget process with varying source IPs specified with --bind-address. It would, however, be trivial to add a --randomize-bind-address option to wget source.

and you need logic to handle things like auto generated/tar pits/temporarily down sites, etc.

What makes you think you can't handle these things with wget?

And of course you want to coordinate all that while simultaneously extracting the list of URLs that you'll hand over to the crawlers next.

Again, why do you think wget is inadequate to this? It's not.

Any custom-coded wget alternative will be implementing a great deal of wget. Most limitations of wget can be avoided by launching multiple wget processes, putting a bit of intelligence into the glue that does so. If that isn't enough, it probably makes sense to make minor alterations to wget source instead of coding something new.

My point here is just that wget is way more awesome than you give credit.

Comment Re:"Community service" == free labor for the state (Score 1) 485

The inmates (ahem, volunteers) are motivated by the contract they sign which allows them to be put in jail for failure to satisfy their supervisors (or, if not put in jail, then at the very least put to work for another day).

It's not dissimilar to the incentive system at work behind minimum-wage employment. It works.

(What are you basing your 'doubt' on, anyway? You seem unfamiliar with the system. Are you?)

Comment "Community service" == free labor for the state (Score 1) 485

"Community service" means doing for free what the state would otherwise have to pay minimum wage to have done. The economic incentive is still there.

Your 'reading to kids' scenario is a myth (an exceptional sort of thing that might result from negotiated plea bargains involving high-priced lawyers). For the masses, "community service" is just forced labor.

Comment Re:Bringing Claude Shannon to higher education (Score 1) 165

Most of your response does not address my original posts at all. I'll address the one point that does.

Are the best communities the product of local universities or the global village?

How does one discern the goodness of a community?

Arbitrarily. Here's one metric: where can I go to get a physics question answered? Who will answer my physics question fastest and in most detail? I don't think I will find the fastest answer at a university [online or not].

But by all means blog about it after class.

That's some smug attitude you got there, but here on slashdot, we write programs after class...

Although I have to say, there's a lot more value in any blog that people actually read, than in a college paper written for an audience of one grader -- who will learn nothing from it.

Comment Re:Bringing Claude Shannon to higher education (Score 1) 165

The internet is at every university already. Campus denizens are overrepresented in many/most/all online forums. It isn't a question of one or the other, but rather of maximizing the benefit from both styles of communication.

OK, but I'm not talking about "styles of communication," I'm talking about the communicating communities themselves. Are the best communities the product of local universities or the global village? It is going to depend on specifics, but usually the local community -- no matter what sort -- is not going to be able to compete.

It's just so much easier to form connections at light-speed than whatever the average speed of a human body is.

Comment Re:My memorable college experience was getting lai (Score 1) 165

To me, the classic moment of college was standing up in a classroom having to defend a position that people disagree with. And then arguing about it later in the cafeteria or dorm. If you've never spent all night arguing over the existence of God, then you never had an education.

I was doing this sort of thing when I was fifteen -- on the internet, with adults [including, by happenstance, a math professor]. There are entire internet forums devoted to arguing about god. Really, are you thinking about what you're saying? Do you realize where you are? If you want all-night arguments, the internet is going to beat any university...

And yes we did have a few drinks or a joint. And yes it's nice to have some girls join you in your intellectual explorations.

The only reason I ever went to university was to meet girls.

Comment Re:Consider Star Trek... (Score 1) 165

It's hard to find as great a concentration of intelligent people with an interest in a certain specialisation on the Internet as at a university. It's even harder to find a place with a high concentration of intelligent people with an interest in a certain specialisation and a lot of intelligent people working in a completely different field on the Internet.

I really don't think this is the case. Especially if you include "intelligent." For example, try to find a localized group that can compete with Undernet's #math for opportunities to talk about advanced math. I doubt one exists in the world; I certainly wouldn't expect to find one at arbitrary university. Certainly, if I had a math question, it would make more sense to go there than to a university. Especially at 3am.

There's a reason why so many math majors & grad students spend so much time on IRC talking about math, rather than spending that time talking about math with their local peers.

The internet connects everyone in the whole world, so for any selection criteria, with such a larger pool, it's almost always going to win.

Comment Re:Consider Star Trek... (Score 1) 165

Both in undergrad and grad school, I learned way more from random discussions, be they with other students or professors, than I ever did during the official class time. So much of an education is had by being around others who are also interested in the same things and eager to talk about it.

Because it's so hard to find people to talk to on the internet??

Comment Re:One sentence discredits the whole article (Score 0) 165

Actuall, UofP is VERY good for certain types of degrees. Computer Science being one of them. While I don't have a degree from UofP, I have worked with IT people who do, and they were smart, motivated, well educated people.

Correlation is not causation. Perhaps the types that get a UofP "education" have been hacking since they were 12.

Comment Re:Erm.... Labs? (Score 1) 165

My girlfriend just got finished telling me she's doing vet school on about $6000 a year for tuition and living expenses.

She may have a large scholarship.

At a lot of expensive USA private schools (most of them if the sample with which I am familiar is representative) 10%+ receive enough scholarship money to pay about what a community college costs. (Still, $8000 is on the low end of that if you actually include living expenses; $8000 is about enough to rent a single bedroom in NYC.) But there is always the other 90%.

Comment Re:One sentence discredits the whole article (Score 3, Insightful) 165

The University Of Phoenix education is a complete and utter joke. What they teach is worthless and best and counterproductive at worst(and yes, I have seen some of the content of their masters programs, assignments that include algebra I was doing in 7th grade and homework questions like, "What is a MAN?")

That doesn't matter, because what universities sell is not education but credentials.

After all, the internet as a whole provides a much richer educational environment than any university possibly could, "internet university" or not. (Indeed, classes in ordinary universities are also a joke, if you're accustomed to learning things without being forced.)

But just learning things won't help you get you a job. I have heard perfectly competent hackers talk about going back to get another degree (in computer science) even though they know they wouldn't learn anything there, because it would help them get higher-paying jobs.

So yeah, there's a market for credentials, and the less time you have to waste pretending to be learning what in fact you already know, the better.

Slashdot Top Deals

Neutrinos have bad breadth.

Working...