Bow Tie Theory: Researchers Map The Web 133
Paula Wirth, Web Tinker writes "Scientists from IBM Research, Altavista and Compaq collaborated to conduct the most intensive research study of the Web. The result is the development of the "Bow Tie" Theory. One of the initial discoveries of this ongoing study shatters the number one myth about the Web ... in truth, the Web is less connected than previously thought.
You can
read more about it "
how does it work? what is next? (Score:1)
Now they have showed that the web is not as connected as many people have speculated, do they have any solutions?! What are possible solutions?
Have YOU really thought it through? (Score:2)
OK, first off, I'm really getting sick of this. Nerds don't have to be interested in Linux or Open Source to be a Nerd. That's how you define yourself and therefor you think that all nerds should be like you, right? Because this doesn't interest you doesn't mean that you have to troll the article or that it's not important to other people that consider themselves nerds...
I have been using the Internet almost since it started, and can even remeber the pre-web technologies like "gopher", "wais" and "veronica". If I need to find a web page, I can always use one of the major search engines like googal and altavista it doesn't matter to me if Joe Average's page is not linked, since it is probably something he hacked together one evening and put up at geocities, and has not updated it for over a year.
Wow, how often do you use "googal"? I mean, if you can't even spell google right, why should we believe that you have any knowledge of the internet? And the typical "Joe Average" doesn't have that page up there for you, it's for family and friends. Very few personal pages have a target audience larger than people they know or people that want information on them.
I find all these "personal" pages on the web are a major irritant, as they seldom contain useful information, and they clog up the search engines with non-relavent crap, by polluting the search space.
If I want to know what Joe Sixpack in Assmunch Arizona called his dog, or to see pictures of his pickup truck, I would ask him. But I don't.
Have you heard of logical searches? If you know how to search the web properly, you should be able to find just about anything that you are wanting within the first 5 hits. Know what search engines to use for what you want and how to use the logical operators to filter that "non-relevant" crap.
It is about time that us "geeks" re-claimed our Internet from the dumbed down masses. We should return to the days of ARPA, when only people with a legitimate requirement could get net access. The "democratization" (i.e. moronification) of the web has gone too far and is responsible for the majority of problems us "original internet users" are seeing. The flood of newbies must not only be stopped, it needs to be REVERSED. These non-tech-savvy people need cable TV, and not something as sophisticated and potentially dangerous as the Internet.
Perhaps a new more exclusive "elite" (in the good sense of the word) Internet should be set up, running only IPv6. Then we could capture some of the community spirit of the pre-AOL "good old days". And maybe these spammers, skript kiddies and trolls would back off.
Ooh, just what the web needs, more "elite" people like you. Dammit, the web is about information, equality and business. It's not just for you "31337 H@X0RZ" anymore. Grow up! Most of the technology that you're using today was devoloped because of the popularity of technology. You try to "reclaim" technology for your little group and you'll shink the market for it so much that companies won't bother with it.
kwsNI
Re:Sources and sinks (Score:1)
I would say it would be a generally bad idea for them to link outside, as the ultimate goal is to keep visitors inside as long as possible. That's an absolute no-no in corporate web page design to link outside..
__________________________________________
Dark Matter pages (Score:1)
I'm sure that within these "Dark Matter" pages you would find a large number of "unsavoury activities" being carried out. Perhaps it would be valuable for an ISP to consider scanning web server assets (HTML, etc) to determine how much of it is unlinked - and to make a ToS/AUP that bars sites with too high a Dark Matter percentage. Especially if the server logs show traffic hitting the dark pages. Prime indication that something "non-connected" is going on - which if it isn't outright illegal is probably at least non-PC, webwise.
Thanks for the obvious (Score:1)
I think the author of this article miswrote it's title, it should have been something like... Dumbasses who believe everything they read and hear about the Internet and cannot think logically for themselves should all be shot and refused Internet access. Researchers at IBM, Compaq and AltaVista have done a study to show what anyone with any intelligence would already know.
But perhaps that was a bit too long.
Re:Have they really thought it through. (Score:2)
The good ones are expecting their consumers to pay them back, the bad ones are trying to IPO.
What does this mean? Do the Ciscos of the world expect to stay in business by having end consumers a penny at a time repay their VC debt? I wouldn't think so. And only bad companies IPO? Thats a rather shallow view isn't it?
A distributed effort to create a map of the web? (Score:1)
Computer A connects to 123.45.12.34
Computer A sends http request to 123.45.12.34
If request is replied to, add IP to list of web server adresses - If request is ignored, increment IP by one and repeat
Trouble is, we need a way to scan through these adresses in a way that doesn't take too long, preferably, less time than the half-life of a web server. What good is a list of web servers if half the entries are invalid? It might make a great starting point for a webspider, but it's no Encyclopeadia Internautica.
4 billion adresses, 1 hundred computers dedicated to searching, 5 seconds spent on each IP address. ETA: 6.8 years. A joke.
4 billion adresses, 10 thousand computers in a distributed network, 10 seconds spent on each IP address. ETA: 49.7 days. Doable.
We need a Seti@Home approach to this. Naturally, IPv6 presents a new problem.
Re:Sources and sinks (Score:2)
There's actually a very good reason for this: include a link on your page, and there's a non-zero chance that the viewer will follow it. If the web page in question is essentially an ad (which many pages are, these days), having someone follow a link off-site is like watching them change the channel when your commercial comes on the TV. Why provide them with the out?
Re:I doubt you're a developer (Score:1)
Heh. What happens when they have a row...? Does he get insulted in front of the whole community ??
Goodevening, and here is the news. Today I caught my husband leering at the neighbour's teenage daughter...
Of COURSE there are four groups (Score:4)
To: whomever knows graph theory? (Score:1)
Re:The /real truth/ about web's topology... (Score:2)
Devil Ducky
Looks like an amoeba (Score:1)
We all knew that Internet was in its early development stage but I thought it was already closer to some multicelled trilobite [magma.ca] then a single celled organism!
Re:Alternate Pasta-Based Web Theories (Score:1)
What about secure sites and generated pages? (Score:1)
How to clean up the Internet (Score:2)
Yes, I agree entirely with you and have been considering the issues you raise for some time. The most urgent needs facing the internet today are a) to get rid of current users who aren't capable of using it, and b) preventing further users from accessing it in the future. I have some ideas on how to proceed with these two points.
IMHO, the solution is to stop letting everyone access the web. There are two ways this should be implemented. Firstly, anyone under the age of 18 (or 16 or 21) should not be allowed on the web at all. Until they are adults, they cannot be trusted to handle the large amounts of dangerous information which the web can provide to them, and during this vulnerable stage in their life they can be swayed by rhetoric and promises. Doing this will immeadiately stop the market for censorware and filtering, since if only repsonsible adults can use the net, then they can handle what is seen there, and get rid of pirates, script-kiddies and trolls who are almost exclusively under 16.
Secondly, access to the web should be dependent on some kind of examination process, whereby people who want to use the web have to take a test to determine their suitability. In this way we can weed out the undesirables from the net and make sure that the content on it is of uniformely high quality. Rather than having sites dedicated to racist hate, terrorist manifestos and anti-Christian diatribes we can have decent sites which educate and enlighten readers, like we had before open access.
Now, I know these comments will offend some /.ers, but try to look beyond your liberal hand-waving for a minute and think about these proposals. The net is becoming a cesspool, and this is the only way to clean it up.
Good to see a study on this... (Score:2)
I love Google, but lately when I search I get more results consisting of dead links and posts to message boards than any useful info. I've been on the mailing list for the Search Engine Watch [searchenginewatch.com] newsletter for a couple years now, and while there's a lot being done to weed through all the fluff, IMHO the fluff is growing at too high a rate for the technology to keep up with presently.
Anybody currently active in the industry got an insight into how search engines are combatting all this expired flotsam?
The Divine Creatrix in a Mortal Shell that stays Crunchy in Milk
Re:Sources and sinks (Score:2)
They haven't realized that people want information from more than one source. They also haven't realized that providing links to those alternate sources will improve their credibility.
Re: Personal/Coporate (Score:1)
Having said that I guess some pages would be hard to categorise.
Re:The web is broken. (Score:2)
Just so you know, I'm not here to criticize your opinions. I'm here to criticize your sig. The first, most obvious problem, is that you are missing a period at the end of your sentence. Please fix this. Secondly, you should not have hyphenated that sentence. That's just wrong in so many ways. In mid-sentence -- like this -- you use a dash. In plain ASCII text, a dash is two -- count them -- two hyphens. There are other characters available, but those fall outside the 7-bit range and therefore, they cannot be trusted. Not that any of this matters because you should have used parentheses (the little round things on either side of this little comment here) or an ellipsis...
Here are some samples of what your sig should look like:
All opinions are my own (until criticized).
or
All opinions are my own
You must please understand sigs are very important. Unlike comments, you can change your sig and fix it and make it look pretty. Anybody that criticizes spelling in a post is an elitist and a hypocrite, but sigs can be changed. You can make a difference!
That said, it's time you got yourself a new sig. Thank you.
Re:How useful is this? (Score:1)
And these, the knot in the bowtie, are the meat of the Internet: real content with links elsewhere to cover what's not on that site. One example would be Slashdot.
Some tend to link only to themselves.
Your average corporate site...
Some want to be noticed so they provide lots of links, but aren't truly interesting, so nobody links to them.
And finally, we have the disconnected pages linked to originating sites, which are linkless homepages and other contentless cruft.
None of this is particularly surprising, we've all seen examples of each type, although the ratio might have been a bit of surprise -- it seems to be about a 50-50 split between commercial and non-commercial. What would be even more interesting is a traffic analysis: how much of the Web's traffic is in that compact 30% core? I'd wager around 90%.
Cheers,
-j.
Re:Just Wondering (Score:1)
I didn't say that was bad.
Doesn't have the same ring to it. (Score:2)
kwsNI
Is this a dup also? (Score:1)
Obvious glaring error (Score:2)
How did they do this? They used Altavista.
So their entire theory of "bow-tie connectedness" conveniently forgets that Altavista exists. Fortunately for us web users, Altavista (insert your favorite search engine) does exist, and its existence seems to invalidate their hypothesis.
So it's an interesting idea, but if it ignores the existence of search engines it doesn't really hold much meaning.
Alternate Pasta-Based Web Theories (Score:3)
-josh
Web Theories (Score:1)
I wonder if pr0n sites map out to a 'dirty picture and paper towel' image?
Re:Is it just me ? (Score:1)
IBM has hit a gold mine. Think of all the things they could sell!
Re:Have they really thought it through. (Score:2)
I find all these "personal" pages on the web are a major irritant, as they seldom contain useful information, and they clog up the search engines with non-relavent crap, by polluting the search space.
Really I think you should be blaming the search engines for that, not the web itself. It's the search engines who index it, after all.
The most convenient way to fix this problem would simply be for all your favourite sites to use meta information properly. This is exactly what it was designed for. Unfortunately there are too many lazy designers around that don't bother to implement it properly, so it's no wonder that search engines have trouble indexing and promoting it appropriately. Most geocities users who don't update their homepage for a year won't know or care about how to use meta info effectively, and it would quite easily demote their pages by default.
I don't want to sound too boring but one of the best things about the net in its (mostly) unregulated state today is it's openness and how it lets information be distributed so easily. Sometimes this information is unreliable but the same mechanism can't prevent open debate about the same information, either.
Personal homepages are simply documents that somebody has placed on a server indirectly attached via a network to your own. If you don't like them, disconnect your computer from that network. If you want a censored system, then by all means design it, patent it, and only sell the rights to the people you want to use it.
Re:The web is broken. (Score:2)
the bleeding edge will always be messy as new technologies race to be ahead and others fall down
it's ALWAYS been like that in virtually every field of study in computing and no doubt humanity
ride the wave or swim back top shore, your choice
This is no surprise (Score:5)
Look at the money going into streaming media [windowsmedia.com]. A large segment of the business world still sees the internet as just another medium for TV [den.net] or radio [realnetworks.com] broadcasting. By it's very nature broadcasting is not interconnected, it's passive and linear.
Tim Berners-Lee [w3.org] wrote in his book, Weaving the Web [w3.org] that the main obstacle to the web being a true information web of shared knowledge is that content is controlled by too few. He was upset that browsers were developed which could not edit web pages like his original browser/editor [w3.org].
The silver lining to this, IMHO, is the "weblog" phenomenon, including sites like Slashdot, where ordinary users can contribute their ideas, especially in html format so that they can contribute links. I really believe that some day soon the conventional media sites will be forced to give this kind of capability to their readers, or else risk losing all those eyeballs to Slash-like sites.
"What I cannot create, I do not understand."
Re:The web is broken. (Score:2)
Netscape and Microsoft have market shares enough that their "features" are used, but none are big enough to set a de facto standard.
Wouldn't it be nice if *one* browser had a flawless implementation of the W3C standard?
why didn't you link to the abstract (Score:1)
lower those common denominators
coming soon :
!!London Bus found on the Moon!!
!!new Yeti! pictures!!!!
**win win win**
How useful is this? (Score:4)
This makes complete sense. If every page had links to every other page, you would never be able to find anything. Each page would have too many links. The way the web is developing, you start looking for info within the IN group (usually a search engine or someones index page). This lead to the SCC which eventually points you to a leaf node in the OUT group which has the truly interesting information.
I find this structure to be efficient and elegant.
Re:Have YOU really thought it through? (Score:1)
It wasn't always that way, now was it? (Score:2)
Of course the web was atrocious, but if you found a dumb page (take my old one [skylab.org] for example) there was always something linked to that WAS moderately interesting.
Wiki [c2.com] pages are awfully remniscient of the "old web". (Of course that one is centered around eXtreme Programming and kinda boring, IMO, but it's the principle!)
Oh well. The corpratization of the web has brought lots of cool things, too; they're just harder to find now.
Re:Dark Matter pages (Score:1)
> bars sites with too high a Dark Matter percentage
Um. Censorship, especially that not based on an analysis of the actual content is like, real bad.
Re:Why more connected? (Score:2)
And I'm not arguing the opposite. We shouldn't just link every single word to Everything2.com just because we can, and, God Forbid, our site would not be linked enough to the core if we did not. The content has to weighed. What frustrates me even more than a page with absolutely no applicable links (when it would be useful), is a site which has big blue glaring links all over the place and I can't find
Danger Will Robinson! (Score:1)
This is comming from an IBM spokesperson. Is anyone else upset? Charging for linking?
Re:Danger Will Robinson! (Score:1)
Sheesh, Troll? (Score:1)
Can we get some kind of IQ test going before allowing people to meta-moderate? Something along the line of:
Q. You have five apples and you eat one, do you have
1.More
2.Less
3.My butt itches
Rich
Re:US right to vote is worthless anyway [OT] (Score:2)
Let's be honest now and drop the sarcasm. In America we are free... to a point. We live under a mass of laws that have been enacted over time to appease one group or another. Some of this is good... some of it is bad or just down right unenforceable, in and of themselves. As for standing a better chance here then anywhere else I think you would have a pretty good chance in Canada, Great Britian, or several other countries. America does not have a lock on success in this world. We happen to just be the most arrogant about it (unfortunatly).
Impact of secure sites and generated pages? (Score:4)
Linked by (Score:1)
Re:Have they really thought it through. (Score:2)
OK, so you're trolling, but I almost agree with you anyway. Back in the late '80s, I was wishing more people were connected to the net. It was a great place to be. Now they're all here, I occasionally find myself wishing they weren't. The problem is that there's no quality control. If only people with half a brain were allowed Internet access, we wouldn't have the AOL syndrome. But real life isn't like that. For better or worse (overall, I think it's for better, despite the problems it causes), the unwashed masses do contribute to the essence of the net. For every 1000 AOL lusers, the general population gives us a Rob Malda or an Iliad. Not an ideal ratio, but better than nothing.
Re:The web is broken. (Score:2)
Is there a proper way of using the web? I don't thing so. The web is many things to many people. That's what makes it so alive and so interesting.
there are too many poorly done corporate sites
Sure, so what? See Strugeons's Law (90% of everything...). A badly designed corporate site tends to be its own punishment.
We need more of these research projects to help us figure out what needs to be changed.
Seems like you want to impose good taste and proper programming practices on the web. Thank you very much, I'll pass. I don't want the web to be Martha-Stewardized.
Kaa
shape (Score:1)
Re:This is no surprise (Score:1)
BTW, if you hadn't pissed away your karma, you might have been able to moderate me down.
"What I cannot create, I do not understand."
Re:Read more at K5 (Score:1)
Some ways of finding unlinked to sites (Score:1)
1) Sites that have their own domain.
2) Sites that share their domain with others.
Sites with their own domains can be found simply by searching domains. I would think that NSI and other registries would be willing to part with their zone data for research purposes by a reputable organization.
Sites that share a domain are harder. These could probably be estimated by finding the ratio between pages reachable from the domain's home page and those not reachable for known sites and extrapolating.
Another useful source, if they can get it, is Alexia's data. Alexia tracks pages as the user visits them. As a result, any page that any user (using the Alexia plugin) visits, Alexia can catalog. I have caught alexia crawling pages of mine that were deliberately set up to not have any links to them.
Re:This is no surprise (Score:1)
Amen. I'm tired of hearing about how e-business is the next big step in the web (but I doubt anybody here buys that sort of media hype anyway).
The web is where it is today because it allows people to publish to the world for peanuts, period. The next big step in the web is making it easy for novice home users to publish content without using costly third party hosting. These are the things required for this to happen:
I know this will probably lead to the web being further polluted with poorly designed pages and midi music, but it's worth it since this would probably double online content overnight.. and we all know content is what matters, right?
Nerd != Linux (Score:1)
Referances (Score:2)
Adamic and Huberman (2) 99. L. Adamic and B. Huberman. Scaling behavior on the World Wide Web, [xerox.com] Technical comment on Barabasi and Albert 99.
Aiello, Chung, and Lu 00. W. Aiello, F. Chung and L. Lu. A random graph model for massive graphs, [xerox.com] ACM Symposium on the Theory and Computing 2000.
Albert, Jeong, and Barabasi 99. R. Albert, H. Jeong, and A.-L. Barabasi. Diameter of the World Wide Web, Nature 401:130-131, Sep 1999.
Barabasi and Albert 99. A. Barabasi and R. Albert. Emergence of scaling in random networks, Science, 286(509), 1999.
Barford et. al. 99. P. Barford, A. Bestavros, A. Bradley, and M. E. Crovella. Changes in Web client access patterns: Characteristics and caching implications, in World Wide Web, Special Issue on Characterization and Performance Evaluation, 2:15-28, 1999.
Bharat et. al. 98. K. Bharat, A. Broder, M. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: fast access to linkage information on the web, [digital.com] Proc. 7th WWW, 1998.
Bharat and Henzinger 98. K. Bharat, and M. Henzinger. Improved algorithms for topic distillation in hyperlinked environments, [digital.com] Proc. 21st SIGIR, 1998.
Brin and Page 98. S. Brin, and L. Page. The anatomy of a large scale hypertextual web search engine, [stanford.edu] Proc. 7th WWW, 1998.
Butafogo and Schniederman 91. R. A. Butafogo and B. Schneiderman. Identifying aggregates in hypertext structures, Proc. 3rd ACM Conference on Hypertext, 1991.
Carriere and Kazman 97. J. Carriere, and R. Kazman. WebQuery: Searching and visualizing the Web through connectivity [uwaterloo.ca] , Proc. 6th WWW, 1997.
Chakrabarti et. al. (1) 98. S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource compilation by analyzing hyperlink structure and associated text, [decweb.ethz.ch] Proc. 7th WWW, 1998.
Chakrabarti et. al. (2) 98. S. Chakrabarti, B. Dom, D. Gibson, S. Ravi Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Experiments in topic distillation, [ibm.com] Proc. ACM SIGIR workshop on Hypertext Information Retrieval on the Web, 1998.
Chakrabarti, Gibson, and McCurley 99. S. Chakrabarti, D. Gibson, and K. McCurley.Surfing the Web backwards, Proc. 8th WWW, 1999.
Cho and Garcia-Molina 2000 J. Cho, H. Garcia-Molina Synchronizing a database to Improve Freshness [stanford.edu] . To appear in 2000 ACM International Conference on Management of Data (SIGMOD), May 2000.
Faloutsos, Faloutsos, and Faloutsos 99. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power law relationships of the internet topology, ACM SIGCOMM, 1999.
Glassman 94. S. Glassman. A caching relay for the world wide web [digital.com] , Proc. 1st WWW, 1994.
Harary 75. F. Harary. Graph Theory, Addison Wesley, 1975.
Huberman et. al. 98. B. Huberman, P. Pirolli, J. Pitkow, and R. Lukose. Strong regularities in World Wide Web surfing, Science, 280:95-97, 1998.
Kleinberg 98. J. Kleinberg. Authoritative sources in a hyperlinked environment, [cornell.edu] Proc. 9th ACM-SIAM SODA, 1998.
Kumar et. al. (1) 99. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for cyber communities, Proc. 8th WWW , Apr 1999.
Kumar et. al. (2) 99. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Extracting large scale knowledge bases from the Web, Proc. VLDB, Jul 1999.
Lukose and Huberman 98. R. M. Lukose and B. Huberman. Surfing as a real option, Proc. 1st International Conference on Information and Computation Economies, 1998.
Martindale and Konopka 96. C. Martindale and A K Konopka. Oligonucleotide frequencies in DNA follow a Yule distribution, Computer & Chemistry, 20(1):35-38, 1996.
Mendelzon, Mihaila, and Milo 97. A. Mendelzon, G. Mihaila, and T. Milo. Querying the World Wide Web [toronto.edu], Journal of Digital Libraries 1(1), pp. 68-88, 1997.
Mendelzon and Wood 95. A. Mendelzon and P. Wood. Finding regular simple paths in graph databases [toronto.edu], SIAM J. Comp. 24(6):1235-1258, 1995.
Pareto 1897. V Pareto. Cours d'economie politique, Rouge, Lausanne et Paris, 1897.
Pirolli, Pitkow, and Rao 96. P. Pirolli, J. Pitkow, and R. Rao. Silk from a sow's ear: Extracting usable structures from the Web [acm.org] , Proc. ACM SIGCHI, 1996.
Pitkow and Pirolli 97. J. Pitkow and P. Pirolli. Life, death, and lawfulness on the electronic frontier, Proc. ACM SIGCHI, 1997.
Simon 55. H.A. Simon. On a class of stew distribution functions, Biometrika, 42:425-440, 1955.
White and McCain 89. H.D. White and K.W. McCain, Bibliometrics, in: Ann. Rev. Info. Sci. and Technology, Elsevier, 1989, pp. 119-186.
Yule 44. G.U. Yule. Statistical Study of Literary Vocabulary, Cambridge University Press, 1944.
Zipf 49. G.K. Zipf. Human Behavior and the Principle of Least Effort, Addison-Wesley, 1949.
___
The Astral Plane Model (Score:1)
The internet is like the Astral Plane. We access it from the "real" world all the time. Time and space obey different rules there. You can leap from place to place, no matter the distance, in the same amount of time, and from any location. You can even be in more than one place at once. Matter can be changed and manipulated at will. Cyberspace and alternate dimensions have a lot in common.
Re:Alternate Pasta-Based Web Theories (Score:1)
That's a spicy meat-a-ball!
Re:The web is broken. (Score:5)
Then don't! Thats one thing many "web authors" still don't get... The WWW is a text-oriented medium. It's a page of text that has links to other pages of text. Everything else is just cruft.
HTML doesn't define how a web site should look to the pixel, and this is one of it's strong points. It's up to the user to decide how to view a site. If the user doesn't want images, your site should look just fine without them.
The minute you start checking to make sure your site looks the same on all browsers, you should re-think your entire site. Why do you want it to look the same on all browsers (it won't by the way...)? This usually indicates that you are focusing too much on presentation and not enough on content.
The web is broke. We're not using it properly
I agree with your second statement. The web isn't broke... people just aren't using it properly. There are so many corporate sites that look like brochures. It's sickening. My previous job was to set up a web page for a small business, and all they wanted me to do was scan each page of their brochure into GIF's, put them up on the web, and put "forward" and "backward" buttons on the bottom to navigate between pages. I said, WTF!?!? The concept of actually including text information and links to other resources was totally absurd to my boss.
These kinds of people think of the web only as a marketing tool, and thus can't take advantage of the power it has to offer.
Re:The /real truth/ about web's topology... (Score:1)
Re:22% of sites were sites that we couldn't find. (Score:1)
Is it me, or does anybody have this mental image of a machine trying every single possible URL (and I'm not just talking domain names, or even just index.html files) and filling them. And then going around see how many could be found in a search engine and then seeing how many linked to a search engine.
My mind boggles how IBM et al managed to find all the 'unlinked' sites to get their figures.
Unless of cause, they guessed...
Richy C. [beebware.com]
--
Uses AltaVista's raw data... (Score:2)
In general, the AltaVista crawl is based on a large set of starting points accumulated over time from various sources, including voluntary submissions. The crawl proceeds in roughly a BFS manner, but is subject to various rules designed to avoid overloading web servers, avoid robot traps (artificial infinite paths), avoid and/or detect spam (page flooding), deal with connection time outs, etc. Each build of the AltaVista index is based on the crawl data after further filtering and processing designed to remove duplicates and near duplicates, eliminate spam pages, etc. Then the index evolves continuously as various processes delete dead links, add new pages, update pages, etc. The secondary filtering and the later deletions and additions are not reflected in the connectivity server. But overall, CS2's database can be viewed as a superset of all pages stored in the index at one point in time. Note that due to the multiple starting points, it is possible for the resulting graph to have many connected components.
Why more connected? (Score:3)
YHBT YHL HAND :-) (Score:1)
Re:The web is broken. (Score:1)
Wouldn't it be nice if *one* browser had a flawless implementation of the W3C standard?
I'd settle for NS going far far away. I'm so sick of seeing work that passes through the W3C validator only to be mangled by Nutscrape.
Wasn't NS supposed to be saved/fixed by open source? When is that going to happen?
Re:how does it work? what is next? (Score:1)
header in your browser, and compounded by the fact a webserver can be placed on ANY port in the 1-65535 port range, to use that theory to find a server is impossible.
Remember, infinite number of possible hosts on 65535 possible ports, on billions or IP addresses.
Thought I'd throw out that theory
Re:22% of sites were sites that we couldn't find. (Score:1)
Although almost by definition, if a site isn't linked, it isn't part of the world wide web, the same as if web pages are on private networks.
An interesting thought is that what if the world-wide web somehow ended up in two halves with no links between the two? Would there be two world wide webs or would the bigger one prevail or would we still choose to think of it all as the one world wide web?
Rich
We're all part of a big BOW TIE (Score:1)
The results of this study have the most poetic conclusion - every part of the internet is taking part in a big BOW TIE.
Sure, some parts of the bow tie aren't as well connected as others, but hey, thats their place. Its takes millions and millions of of sites to put together this universal-intergalatic bow tie, and not everyone can be at the center of it.
This reminds us that we should each make an effort to make our portion of the bow tie a bit nicer. Refraining from talking about petrified-natalie portman-grits really does pay off in the long run, making the heavenly Bow Tie a nicer place to surf.
Fellow linux users, do not insult Window and Mac users, as they are part of the great Bow Tie as well.
The web ain't broke (Score:1)
They want it to have the latest wizz-bang 'features' of a half-dozen different browsers.
They want it to contain ALL 'pertinent' information on the front page, but be clear, concise and readable at a glance.
They want it to PRINT cleanly on an 8.5x11 sheet - or worse yet, on an A4 sheet.
They want it to be secure, and robust and stable, but only if they can have it done TODAY!
Re:The web is broken. (Score:1)
The web allows for many different medias, and yet too often people only use graphics (and the occasional text).
I think that browsers such as the newest IE for the Mac (as much as I despise MS) and iCab (a very compliant browser, even showing HTML mistakes) are the futureof browsers: Compliant.
Other people on this thread are correct: Clients are the problem. Try showing them what their web page looks like in Lynx, or in iCab. "Do you want people coming to your web page to see this?"
Then we come to my pet-peeve: JavaScript. I cannot stand pages that depend on JS-support. It is fine to use, but as soon as I get to a page that is impossible to navigate without JS, I leave.
My final point: People need to realize that the web allows for a relatively seemless media environment. This does not mean, however, that you should pick one form of media and rely on it exclusively on your page.
Someone desperately needs to get laid (Score:1)
This is research? A scientific study of links? Gimme a break. Somebody, somewhere please find these people a date.
Going To Heck (Score:1)
Re:More information (Score:1)
I'd be surprised to see that happen... the way the tunnel would grow would be because more and more links are added. If 30% of sites on the web are in the core, then assuming that the links are to random sites, after typically 4 links have been added it will be connected to the core and therefore fall into the IN group.
Of course, sites aren't added randomly, but since the core by definition contains often-linked-to sites, chances are that more than 30% of newly added links in the tunnel will point to the core.
[TMB]
Slashdot's internal topology... (Score:1)
A silly question (Score:1)
If you want it to look the same on all browsers... (Score:2)
Why don't you do THIS for your customers? It gives them exactly what they want, they can get pixel-level control of how their site looks. They can even digitize paper brochures and do it that way. And by ignoring all of the cruft with crappy HTML 4 different browsers on 5 different platforms, you can probably do the site cheaper and easier this way too.
HTML and pixel-accurate renderings are MUTUALLY EXCLUSIVE. HTML isn't, wasn't and should NOT be designed for that. If you want something better, either design it yourself, or piggyback it on something which works and can be done today. JPEG or GIF or PNG.
If your customers want to look like idiots on the web, then I'm sure they'll like this. Not only this, you should ADVERTISE this advantage. ``The only web design company who's sites look EXACTLY how you want, on every browser on every OS.''
Exactly (Score:2)
I still remember the HP48 websites circa 4 years ago.. 95% of them were crap, just links to another
site that was full of links to other crap sites.. Had ONE of those sites had a catagorized set of links, I would have bookmarked it in an instant. ``links to development software sites'' ``links to fan sites'' ``links to shareware sites'' ``links to math sites''
Need a ladder to get off that high horse? (Score:2)
It is about time that us "geeks" re-claimed our Internet from the dumbed down masses. We should return to the days of ARPA, when only people with a legitimate requirement could get net access. The "democratization" (i.e. moronification) of the web has gone too far and is responsible for the majority of problems us "original internet users" are seeing. The flood of newbies must not only be stopped, it needs to be REVERSED. These non-tech-savvy people need cable TV, and not something as sophisticated and potentially dangerous as the Internet.
While my wife and I often joke about the sentiment of this statement (at least once a day, one of us will point at a website or an email and say, "Yet Another person who really shouldn't be on the Internet"), we also know that actually believing it is horribly shortsighted thinking. Yes, there's a lot of no-content fluff out there on the web. But people have to start somewhere. I wouldn't expect a person's first web page to be more meaningful than "here's my house, my family, and my pets" any more than I'd expect a 6-year-old's first two-wheeler to be a Harley.
Granted, some folks never get past the "training wheels" stage. (Okay, make that "a lot of folks" these days.) But the Internet has long passed the days when it was a tool for a select group of people. If the S/N ratio is dropping precipitously, well, then, improve your noise filters. Make it a habit to include things like "-url:aol.com" in every search if you need to. You're one of the "tech-savvy" crowd (directed at the original AC who posted); use your tech knowledge! If all you can do is bitch about the fluff on the web without using readily-available tools to cut through it, maybe you're not as tech-savvy as you'd like us to believe.
"I shouldn't have to" isn't a valid response, either. In any information search, irrelevant data will turn up, and you're going to have to sort through it anyhow.
Aero
I doubt you're a developer (Score:2)
Clearly, you're not a developer. For those of us who do this for a living, it's about presentation and content. And we're not necessarily designing our *own* websites, we're designing for clueless clients who refuse to be convinced of certain practices/standards no matter HOW MUCH we pound them into their puny skulls.
Web Developers/Designers have the most *clueless* clientele of most any industry, and we have to develop for them, not us. Believe it or not, graphics occasionally look good on a website, and people WANT THEM. And considering how IE and NS handle tables, alignment so very differently, we DO have difficulties making it look the same in both browsers.
For those of us who design, we know that this is a perpetual, neverending headache.
Quidquid latine dictum sit, altum viditur.
oh please! it's a "clip-on" (Score:2)
But getting the analogy wrong just reflects the simplicity of what they've done. Their categorizations based on number of links in and number of links out could have been made a priori. They did measure the size of each group which was somewhat interesting.
A much more interesting study would be of the paths that people actually follow. Who cares what links the author put up if nobody clicks on them. But, the paths that people take would tell us a lot. Do peoples start at their bookmarks? Do they start at portals? Can they be categorized? And the real question: what paths do the people who buy stuff take?
Read more at K5 (Score:2)
Re:The web is broken. (Score:2)
<<**SHUDDER**>>
Re:More information---Not really (Score:2)
If you consider ramsey theory then you'll know that any two coloring of a graph will give a group of vertices that are strongly interconnected (a clique) and/or a group that where none of the vertices is connected to any other(anti-clique).
For example, coloring a complete 6 vertex graph will either give a clique or anti-clique of three vertices. In a social context, this means that in a group of 6 people there will be a group of at least 3 people who either do not know anyone else in the group or know everyone else in the group. Using a theorem by Erdos tells use that the web probably does not have a clique or anti-clique of size greater than 1+log n (here log = log base 2) where n is the number of web sites. Another result says that there is guaranteed to be a clique or anticlique of that is at least as large as the fourth root of n where n is the number of web pages.
Easy to disprove (Score:2)
Rather obvious. (Score:2)
Case in point - recently I needed to find some information about a specific company. Now this company's name is virtually the same as a popular Unix variant (no, not Linux). Searching for the company name, once all the Unix links were weeded out (this was a chemical company) led to some federal documents about the company, but not much other information at all.
As it turns out, the company in question HAS a web site (and has had one for some time) - it just wasn't linked from anywhere I could access on the common search engines.
Still, it's nice to have some data on this...
The /real truth/ about web's topology... (Score:3)
__________________________________________
More information (Score:4)
Our analysis reveals an interesting picture (Figure 9) of the web's macroscopic structure. Most (over 90%) of the approximately 203 million nodes in our crawl form a single connected component if hyperlinks are treated as undirected edges. This connected web breaks naturally into four pieces. The first piece is a central core, all of whose pages can reach one another along directed hyperlinks -- this "giant strongly connected component" (SCC) is at the heart of the web.
In graph theory, a strongly connected component is a set of mutually reachable equivalence classes of vertices in a graph - i.e a group in which every vertice is reachable from each other.
What's interesting is that the four groups mentioned in this article are all approximately the same size, with the SCC group being only slightly larger than the others, which are:
So what they're saying is that really only about a quarter of the internet is the core that is strongly connected to the rest of it. Which is interesting in itself, because I'd have thought it was a lot higher.
Re:Have they really thought it through. (Score:2)
Has it ever occured to you that slashdot was once Rob's personal page ?
A standard compliant web browser. (Score:2)
-Jason
Re:More information (Score:2)
Most interesting, I think, were the tunnels, connecting IN to OUT but bypassing the core. It would seem that such tunnels indicate weaknesses in the make up of the core, which is to say paths of connected interest that for some reason are not included in the core. These, I think would be worth looking at to see if grow or diminish. If a tunnel grew to similar size to the core, it would make an interesting model where IN and OUT have more than one major connecting network.
Most of the media coverage of this was saying that every company wanted to be in the core, but I think that's a very crude take on it. I didn't especially see anything in this study that indicated that interconnectivity was closely linked to traffic, much less relevant traffic.
Re:I doubt you're a developer (Score:2)
Truly, having to cater to clueless clients is what makes all jobs difficult
I still reject the (for some people) dogmatic view that a site needs to look the same in all browsers.
Re:The web is broken. (Score:3)
What you've just described is gopher with links.
I've said this on slashdot before, and I'll say it again: The web is NOT Gopher. The web is a multi-media platform. Including graphics, animation, video, sound, and any other funky stuff people want to throw up on it. The whole "The web should be text. Graphic elements are clutter." mentality makes me sick. I agree 100% that a site should NOT be DEPENDANT on graphics or other 'specialty media' to get content accross. That's what good consideration for the text-based users and ALT tags are for. But a web without graphics is merely gopher tunneled over http.
Why do you want it to look the same on all browsers (it won't by the way...)?
It's pretty simple: clients don't understand the web. They want all that pretty crap. They REQUIRE it to look the same wherever they see it. They expect things as low level as kerning and leading to be the same, universally.
Like I said in my first post, we (as in everyone) need to recognize that the web is a new medium. Traditional media conventions don't apply.
Re:Why more connected? (Score:2)
The web is supposed to be linked together. That's why you put something on the web instead of publishing it in a 'zine or a book [loompanics.com] or any other form of printed materials. Just because you don't want or need one click access to relevent information doesn't mean that it shouldn't be there. Would you still visit slashdot if it didn't link to the articles it talked about? Would suck [suck.com] be any good without links?
I'm not arguing for linking to random information just because you can, but informative linking is why hypertext [xanadu.net] has the hyper.
--
Re:The web is broken. (Score:2)
I have to disagree with you there. Undoubtedly, the web started out as and was designed for a text oriented medium of information propagation. However, it is also true that it has outgrown its original design. How else do you explain "IMG" tags? Why would they be required in a txt only medium?
Yes there are limitations originating from its design goal that generate a sense of awkwardness when implementing graphic oriented pages. However, there are principles of web page design which can be followed to minimize the awkwardness. Graphics is now very much on the web : deal with it the best you can. Closing your eyes and hoping it will go away is not a good solution.
I have no solution for the original problem posed regarding programming for multiple browsers. This is inded a bitch. But the one about multiple resolutions is much more easily fixed : program your webpages to a fixed resolution. I contract at IBM, and IBM's standard is that the webpage must be displayable on a 640x480 resolution without having to scroll. There are exceptions to this rule of course, but these sites need to get approval for exceptions from higher up.
Re:Have they really thought it through. (Score:2)
IUt is not everyones internet. The internet was funded by business for business and is supported and enhanced by business and for business. You are an invited guest here, mind your manners.
The dumbing down is done by the masses, but it is neither wanted nor promoted. The internet gets it's legs from the billions in capital business (mostly US) provide for their benefit, not yours. Pr0n, Joe sixpack's dog pics and AOL crap are just unwanted byproducts.
Re:The web is broken. (Score:2)
How do people find your pages? How do indexes and search engines work? It's all based on the textual content. Google doesn't do OCR on your GIFs of scanned brochures, or voice recognition on your MP3s of your radio spots. Even if images or sounds are the focus of your site, you'd best have plenty of text that indexes and describes that content.
What loads fastest, given surfers the information they're looking for in the least time? Text.
What can be displayed in the user's choice of colors and fonts, so that it's legible in any situation? Text.
What can be rendered on a PDA, or read by a text-to-speech converter for the blind? Text.
What should web designers do when clients don't understand these issues? Apply the clue stick. Gently and with respect, but firmly, make it clear that you know more about the internet and the WWW than they do, that's why they are paying you, and if they want an rhinestone-encrusted illegible and unusable site that takes three days to load over a 28.8k PPP link, then they can hire a 12-year-old who's just finished reading HTML for Dummies instead of a professional - and then spend the next few months wondering why they bother having a web site, since it's done fsck all for their business.
22% of sites were sites that we couldn't find. (Score:2)
Sources and sinks (Score:3)
2) Advertisers and news sites link into corporate pages
3) Personal home pages are highly likely to link into popular sites, but not be linked-into themselves
Applying these ideas, and others like them, leads to the "bowtie".
Re:The web is broken. (Score:3)
> The web is broke. We're not using it properly
I agree with your second statement. The web isn't broke... people just aren't using it properly. There are so many corporate sites that look like brochures. It's sickening. My previous job was to set up a web page for a small business, and all they wanted me to do was scan each page of their brochure into GIF's, put them up on the web, and put "forward" and "backward" buttons on the bottom to navigate between pages. I said, WTF!?!? The concept of actually including text information and links to other resources was totally absurd to my boss.
These kinds of people think of the web only as a marketing tool, and thus can't take advantage of the power it has to offer.
Look at news sites. Howmany times do you come across a articles that are word-for-word taken directly from the printed page. (Almost to the fact that it says, "continued on page 3C".)
The worst part is the page-turning. You know, the "next page" links at the bottom of articles. That right there is a sign that your sight is broken. You're using a static and linear approach in a dynamic and nonlinear medium.
Break the story up. Link God damn it! If a comany gets mentioned link to it, not one of those pathetic stock quote drivels that news sites make. If some person made a speech, don't just quote the one or two sentences, link to the speech.
I'm convinced that the web is going to suck until our children ascend to power. Look at television. In the early days of the late 40s and 50s everything was very rigid. You basically had radio programs being done in front of a cammera. After a generation was raised on televions did you actually get programs that started to take advantage of the medium. Compare how news was done in 1950 to how it's done today. Look at educational television. Before you had the monotone droning voice of an old man, and now you have Sesame Street. The same thing is going to happen to the web.
Re:Have they really thought it through. (Score:2)
The web is broken. (Score:5)
The web is broke. We're not using it properly, there are too many poorly done corporate sites, contributing to insecurity, poor usability and incompatibility.
Many clients we work with are dead set against sending anyone away from their site. I don't think they realize that links are what the web is made of. This contributes to the unreachable part of the bowtie. These corporate folk are afraid that by linking away from the site, they will lose a viewer, and that use won't find their way back. They don't realize that the web is a pull technology, and the if the user was looking for certain information, the user will come back if it is the best source of such info. The back button is one of the browsers most used features.
We need more of these research projects to help us figure out what needs to be changed. The W3C is a start, but it's expensive to join [w3.org] and it's rare that you find a website that conforms to the standards. In fact, I've run into web developers who have never HEARD of the w3c.
The web is a new, completely different medium. It's not a CDROM, it's not a brochure, it's not TV. We can't keep treating it like these other media.
Re:Is it just me ? (Score:2)
Well, according to the abstract [ibm.com],
So it has immediate practical use if you're writing spiders, and so on. I'm not sure whether "insight into... the sociological phenomena which characterize [the Web's] evolution" counts as something which does you any good, but you never know where the resulting studies might lead.(Anyhow, who says research has to do anyone any good?)
Re:22% of sites were sites that we couldn't find. (Score:2)
I've often wondered howmany sites aren't linked to by any other site and have never been scanned by a search engine. Chances are we'll never know.