Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×

Comment Re:And who gets to define "liberal?" (Score 1) 841

No, Democrats do not see wealth redistribution as a means to eliminating class divisions. Jesus. What Democrat says anything like that? Name one class warfare Democrat in US office.

I don't think you'll find hardly anyone that doesn't agree that poor people should live better than they do

On the contrary, you don't have to look far to find many who say the poor live too well -- that they should live within their means, stop borrowing money on credit cards and mortgages, stop taking handouts, etc., because it's unfair for them to receive what they have not earned.

Comment Re:Oh, just great (Score 1) 841

Nazism is both conservative and liberal, ideologically -- it promises radical change, a new society, a new man, etc. -- but it conceives of this new world as a return to the purity of the old world, tradition, etc..

I don't think there is anything conservative about eugenics though. If conservatives supported it, it wasn't out of conservatism.

Comment Re:Oh, just great (Score 1) 841

I agree with you about the persecuted groups, and about the neutral "two sides of the coin" stupidity...

Still, there is something real to philosophical conservatism that is not simply "the wrong side of history." The pro-soviet left was also on the wrong side of history.. They, as you said, made "new mistakes" -- opening the possibility (since realized) for a completely new level of disaster. If the conservatives at the time were not right, at least their form of wrongness would have been safer.

Conservatism is the voice that says "you cannot design a new society on paper; what exists today has reasons for its existence and is integrated into the social body; to destroy it will disrupt the organic whole, and the way of life of everyone, in ways we cannot predict."

There is value in that voice, so long as it tempers progress, rather than impeding it.

Comment Re:Whew... So there is hope for a cure? (Score 1) 841

Using ad-homs on your opponents won't make you correct.

You have badly misunderstood me. I didn't use any "ad-hom", nor am I opposing anything.

In fact, what I say is closer to a defense of conservative ideas than to an attack. The whole point is that, if on the average conservatives are more stupid, it's not because stupidity makes people believe in conservative ideas -- on the contrary, the stupids don't even understand conservative ideas, but they stupidly call themselves conservatives (instead of stupidly calling themselves liberals).

However you cannot honestly deny that the GOP seeks out the stupid demographic (misinformed single issue voters) in exactly the same way that both parties seek other demographics. The democrats don't give away the stupid vote because they don't want it; they do so because the GOP has already got it.

Comment Re:Define "Liberalism" (Score 1) 841

Yes I'm sorry you ran into a wall and broke your hip, but you've had a job for ~30 years. You have money and should pay the bill yourself out of your personal wages/savings

Only in your fantasy world are the people receiving shitty government services those with savings they could be using instead.

Comment Re:And an absence predisposes you to conservativis (Score 1) 841

Please ignore the anonymous version of this post.

I was a liberal until I began to understand it was my money at stake, and my money is what I use to provide for my family... and distribute to charities as I see fit.

Only collective social action can insure against unemployment, so that everyone (not just x%) can continue feeding their family, regardless of what happens on Wall Street or in China...

Of course I realize this doesn't work. If you cannot threaten to starve a man's children, you cannot force him to husk corn. And a nation that cannot force anyone to husk corn cannot compete against China. I fully realize this.

Comment Re:And who gets to define "liberal?" (Score 2, Insightful) 841

To some people, a "liberal" is someone who believes the government should take care of people who have been left behind someway in the economic process, the unemployed, the homeless, those who are at a disadvantage in some way. Under that point of view, Cuba should be considered one of the most "liberal" regimes in the world.

Sorry, but no, communism is NOT being more of a democrat than the democrats. Communist politics simply do not fit on this spectrum.

There's a qualitative difference between saying that the underclass should have a better standard of living than they do now, and saying that the existence of an underclass should be abolished.

Comment Re:Whew... So there is hope for a cure? (Score 1) 841

In the USA, the GOP consistently courts the stupid demographic, while the democrats have surrendered it. It's not that conservatism is stupid, but that the GOP actually compromises with the stupids, giving them the things they stupidly want (e.g., purely symbolic exclusion of gays, myriad forms of flag-waving), in exchange for power used for unrelated ends (e.g., corporate tax policies).

Of course, the Democrats do the same with, say, the black demographic.

Comment Re:Weak error handling (Score 1) 394

"147 line of code" which does not cover most of what we are talking about.

It does some of what someone (maybe you) said couldn't be done with this approach, thus proving its possibility...

To use the ingredient analogy. If wget is equivelant to a tomato and you change the wget code it is no longer a tomato but a genetically modified tomato that can only be used in that one recipie.

Again, so what? (And who says it can only be used in that recipe? I use the feature I added to wget all the time.)

Does your system handle hundreds of sites without hand editing a config file or script? Does your system monitor runs to see if they complete and figure out what to do if they do not? Does your system tell the difference between a no data timeout and a slow data timeout? Have you solved the problem of coordinating multiple wgets with host spanning?

I already answered the last question (don't use host spanning). With regard to the others, it doesn't matter. More requirements mean more coding, but the general approach of starting with wget instead of coding from scratch is going to get shit done as quickly as possible without reinventing the wheel. You can come up with features X Y and Z that aren't already simple switches (although, notice that others in this thread are listing features that are already simple switches) -- but that in itself is a poor argument for coding A, B, C... from scratch, when there's a lot already done for you.

Now, frankly, the issues you are listing seem pretty damn trivial to me. I just don't see what the big problem is. Still, I don't want to address them point by point in this thread. (I also recognize that, in principle, more difficult features to implement could be thought up.)

I think the real point underlying the article (IIRC...) is that perfectionism (and implementing everything yourself is a form of this) sure can waste a lot of time. If you're trying to make the most of your effort -- instead of trying to make the best piece of software possible -- you need an attitude that searches for a lazy "good enough" solution. If your business model depends on having a better web scraper than anyone else, then you might write one -- but if you're not selling a proprietary web scraper, then it probaly doesn't, and you're wasting your time, losing sight of the big picture.

BTW, the code I'm talking about retries failures infinitely, but only when specifically instructed to do so. Certainly good enough for my purposes at the time I wrote it. It would be trivial in that code to continually devote, say, x% of processes to retrying errors, if you wanted more automation. Just need to decide on x.

The original poster posited that everything can be done using generic unix functions with a little glue. That is patently false considering that there are many features that are part of system requirements that are not covered by standard Unix calls.

(NB. when you say OP, you mean the article; when I say OP, I mean the OPon slashdot, who disagreed.) My point is much more specific: that wget can do a lot more than OP said. I agree that 'xargs' won't suffice to drive wget for this purpose -- unless it does. Depending on your purpose, maybe you should just run it and use the results you get, accepting limitations -- at least you would get results, and without debugging any code.

Anyway, like I say above, although the limitations you list here are real, they still seem surmountable to me, and not with all that much effort. I certainly don't find the possibility of doing so "patently absurd", although the definition of "a little" glue is of course arbitrary. I originally got into the thread because I saw people saying things were not possible which I had already seen done.

The biggest failing of the Taco Bell analogy is that no matter how you combine the eight ingredients you still come up with crappy pseudo Mexican food; you do not create French Food, Itallian Food, Chineese food, etc. The same thing applies to Unix utilities; they do almost way you want but rarely everything.

It's not a failing, it's the point: you can make a lot of money (i.e., succeed in your goal as the proprietor of a corporate restaurant chain) by being content to produce crappy pseudo Mexican food, instead of trying some expensive gourmet menu (with so much more opportunity to fail).

(Is your goal to make the best food or sell the most food?)

Comment Re:Weak error handling (Score 1) 394

That looks like quite a bit of code.

The definition of "quite a bit" is matter of opinion, but I'm talking about 147 lines of perl (including comments and blanks) and 88 lines of shell scripts serving misc. ancillary functions.

Point 5 means that we are no longer using wget but our own version of wget.

Sure, but so what? It's free software.

It also does not fix the multi connection when host spanning is used. It also does not handle sites that we do not want to crawl that may be connected to sites that we do want to crawl.

The problem can be solved.

Comment Re:Weak error handling (Score 1) 394

Does that mean the operator has to manually monitor the crons and restart the ones that failed?

Here's the thing. Either you're writing code to monitor wget processes, or you're writing code to monitor your custom coded wget-replacement (or equivalent logic within an application not divided into processes). Or you're doing it manually, which may be reasonable.

How do you schedule orbitz.com to go off and then soggy.com to go off later?

Write code that launches wget on your schedule... Why do you think this is hard to do with wget?

What of you are handling hundreds of different web sites? Hundreds of crons? How do you retry later on sites that are very slow at the moment? How would you know that wget timed out due to slow download?

Having done this, I'll tell you how I did it. I don't claim it's exactly pretty, but it works, it's easy to do, and it won't cause problems if you're careful:

I enabled wget's logging facilities and scanned the logs for failures. I kept a queue of wget processes to run, and kept a fixed number of wget processes running at a given time. (I changed the number of processes as necessary by hand, although this might have been handled heuristically to maximize resource usage.)

I'm totally confident this approach could be scaled up to the size of the whole internet, because the task is so easily divided into small sections and you're going to hit bandwidth limits long before number-of-processes limits. First assume that you're using a separate process for each host (not a wget process, but the glue process that runs wget). Are there too many hosts for that many processes? No. Are any hosts too big to be handled by a wget-coordinating script? You may think so, but I know you're wrong because I've seen it...

This is a perfect example of the 80/20 rule. The "solution" may cover 80% of the problem but that final 20% will require so much babysitting as to make it unusable. Wget is not an enterprise level web crawler.

You're right that there is a lot of "babysitting" required; you're wrong that the solution to certain of these problems must be "unusable" -- I know because I've seen them solved. My intuition says the others would be similarly solved.

One thing you might have to do is edit wget source.

Comment Re:which language is best? (Score 2, Informative) 394

Wget for crawling tens of millions of web pages using a 10 line script? He doesn't understand crawling at scale.

Wget is made for crawling at scale.

There's a lot more to it than just following links. For example, lots of servers will block you if you start ripping them in full, so you need to have a system in place to crawl sites over many days/weeks a few pages at a time.

wget --random-wait

You also want to distribute the load over several IP addresses

The way I do this with wget is to use wget to generate a list of URLs, then launch a separate wget process with varying source IPs specified with --bind-address. It would, however, be trivial to add a --randomize-bind-address option to wget source.

and you need logic to handle things like auto generated/tar pits/temporarily down sites, etc.

What makes you think you can't handle these things with wget?

And of course you want to coordinate all that while simultaneously extracting the list of URLs that you'll hand over to the crawlers next.

Again, why do you think wget is inadequate to this? It's not.

Any custom-coded wget alternative will be implementing a great deal of wget. Most limitations of wget can be avoided by launching multiple wget processes, putting a bit of intelligence into the glue that does so. If that isn't enough, it probably makes sense to make minor alterations to wget source instead of coding something new.

My point here is just that wget is way more awesome than you give credit.

Slashdot Top Deals

Today is a good day for information-gathering. Read someone else's mail file.

Working...