Slashdot Log In
Googlebot and Document.Write
Journal written by gbulmash (688770) and posted by
kdawson
on Mon Mar 12, 2007 12:06 AM
from the ajax-the-foaming-indexer dept.
from the ajax-the-foaming-indexer dept.
With JavaScript/AJAX being used to place dynamic content in pages, I was wondering how Google indexed web page content that was placed in a page using the JavaScript "document.write" method. I created a page with six unique words in it. Two were in the plain HTML; two were in a script within the page document; and two were in a script that was externally sourced from a different server. The page appeared in the Google index late last night and I just wrote up the results.
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
Full
Abbreviated
Hidden
Loading ... Please wait.

Nonsense words? (Score:5, Funny)
zonkdogfology is a real word:Serious question now - is the author of the article worried that the ensuing slashdot discussion will mention all his other nonsense words? I've no doubt slashdotters will find & mention the other words here, polluting google's index....
Re:Nonsense words? (Score:4, Funny)
It's a perfectly cromulent word, and it's use embiggens all of us.
The Results: (Score:5, Informative)
Re:The Results: (Score:5, Informative)
How does document.write mess up your DOM tree? (Score:2)
Re:How does document.write mess up your DOM tree? (Score:5, Informative)
If you're using document.write, you're writing directly into the document stream, which only works in text/html, not an XHTML MIME type, because there's no way to guarantee the document will continue to be valid.
In this day and age, document.write should never be used, in favor of the more verbose but more future-proof document.createElement and document.createTextNode notation.
Re:How does document.write mess up your DOM tree? (Score:5, Insightful)
Re:How does document.write mess up your DOM tree? (Score:4, Funny)
One of the most clever uses of document.write I've seen was something like: document.write("<--") YOU NEED JAVSCRIPT FOR THIS PAGE document.write("-->")
Re: (Score:3, Insightful)
Based on all the segfaults, blue screens of death, X-Window crashes, Firefox crashes, code insertion bugs et cete
Re: (Score:3, Interesting)
If you code to the standard, at least you can blame browsers fo
True (Score:2)
google.com/?q=slashdotting+in+google+dollars (Score:5, Insightful)
I think the actual experiment here is:
I look forward to the follow-up piece which details the financial results.
Re:google.com/?q=slashdotting+in+google+dollars (Score:5, Insightful)
Re: (Score:2)
Shall we all migrate to Technocrat, anyone? It has decent stories.
Re: (Score:3, Insightful)
It used to be that the web as a whole avoided this crap. Now, it's so easy to make stupid amounts of money from stupid content that a huge percentage of what get
Google Pigeon technolog (Score:3, Funny)
If they weren't, then they're trying (Score:4, Interesting)
Google needs to consider script if they want high-quality results. Besides the obvious fact that they'll miss content supplied by dynamic page elements, they could also sacrifice page quality. Page-rank and the like will get them very far, but an easy way to spam the search engines would be to have pages on a whole host of topics that immediately get rewritten as ads for Viagra as soon as they're downloaded by a Javascript-aware browser. It's interesting to know the extent to which they correct for this.
Of course, there are much more subtle ways of changing content once it's been put out there. One might imagine a script that waits 10 seconds and then removes all relevant content and displays Viagra instead. Who knew web search would be restricted by the halting problem? I wonder how far Google goes...
Re: (Score:2)
Re:If they weren't, then they're trying (Score:5, Insightful)
And if pages are designed using AJAX and dynamic rendering just for the sake of using AJAX and dynamic rendering.. well, they deserve what they get
How did this make the front page? (Score:2, Insightful)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:3, Informative)
Re: (Score:3, Informative)
Did you know that 99% of all statistics are made up?
I can source some Javascript statistics: W3Schools reports [w3schools.com] that,
Re: (Score:3, Interesting)
Google request external JavaScript file? (Score:4, Insightful)
Re: (Score:3, Informative)
Doesn't work; Good (kind of) (Score:5, Insightful)
The model for websites is supposed to work something like this:
In other words, your web page should work for any browser that supports HTML. It should work regardless of whether CSS and/or Javascript is enabled.
So why would Google's crawler look at the Javascript? Javascript is supposed to enhance content, not add it.
Now, that's not saying many people don't (incorrectly) use Javascript to add content to their pages. But maybe when they find out search engines aren't indexing them, they'll change their practices.
The only problem I can see is with scam sites, where they might put content in the HTML, then remove/add to it with Javascript so the crawler sees something different than the end-user does. I think they already do this with CSS, either by hiding sections or by making the text the same color as the background. Does anyone know how Google deals with CSS that does this?
Re: (Score:3, Informative)
Re: (Score:2)
Re: (Score:3, Insightful)
Define "work". A web page without formatting is going to be useless to anyone who isn't a part-time web dev
Re: (Score:3, Insightful)
Re: (Score:2)
Re: (Score:3, Insightful)
Re: (Score:3, Insightful)
The model for websites is supposed to work something like this:
If only. Turn off JavaScript and try these sites:
Re: (Score:2)
Re:Doesn't work; Good (kind of) (Score:4, Informative)
Re: (Score:3, Insightful)
I would make normal links, then use JS on top (Score:4, Insightful)
It's a nice improvement. Less bandwidth used, and a quicker interface.
Unfortunately, it's not often done right. The way I would do it is to first make the menu work like it normally would. Make each menu item a link to a new page. Then you apply Javascript to the menu item. Something like this: (FYI, this is how I do pop-up windows, too.)
Putting it behind a login screen doesn't solve all the problems. You're right that it won't be searchable anyway, but people with older browsers or screen readers won't be able to access it.
I think Gmail actually offers two versions. One for older browser that uses no (or little?) Javascript, and the other which almost everyone else (including me) uses and loves. But I'm not sure how easy it would be to maintain two versions of the same code like that. I also don't think it's nice for the end user to have to choose "I want the simple version", though it may encourage them to update to a newer browser, I guess.
(Of course this is all "ideally speaking", I realize there are deadlines to meet and I violate some of my own guidelines sometimes. I still think they're good practices, though.)
Re: (Score:2)
Accessibility? (Score:2, Informative)
Document.write() is not the way to go (Score:2)
From TFA: (Score:2)
(tagging beta) (Score:2)
Google doesn't, but it's possible (Score:3, Informative)
I'd thought Google would be doing that by now. I've been implementing something that has to read arbitrary web pages (see SiteTruth [sitetruth.com]) and extract data, and I've been considering how to deal with JavaScript effectively.
Conceptually, it's not that hard. You need a skeleton of a browser, one that can load pages and run Javascript like a browser, builds the document tree, but doesn't actually draw anything. You load the page, run the initial OnLoad JavaScript, then look at the document tree as it exists at that point. Firefox could probably be coerced into doing this job.
It's also possible to analyze Flash files. Text which appears in Flash output usually exists as clear text in the Flash file. Again, the most correct approach is to build a psuedo-renderer, one that goes through the motions of processing the file and executing the ActionScript, but just passes the text off for further processing, rather than rendering it.
Ghostscript [ghostscript.com] had to deal with this problem years ago, because PostScript is actually a programming language, not a page description language. It has variables, subroutines, and an execution engine. You have to run PostScript programs to find out what text out.
OCR is also an option. Because of the lack of serious font support in HTML, most business names are in images. I've been trying OCR on those, and it usually works if the background is uncluttered.
Sooner or later, everybody who does serious site-scraping is going to have to bite the bullet and implement the heavy machinery to do this. Try some other search engines. Somebody must have done this by now.
Again, I'm surprised that Google hasn't done this. They went to the trouble to build parsers for PDF and Microsoft Word files; you'd think they'd do "Web 2.0" documents.
Re: (Score:2)
Re: (Score:2)
Does Google run macros in Word documents? No? Then why are you even comparing this?
If you want to see (Score:4, Funny)
If you want to see through a search engine's eyes, open the page in Lynx [browser.org]. The funniest part about showing that method to another developer is when they think Lynx is broken because the page is empty. "It didn't load. How do I refresh the page? This browser sucks." Heh. Endless fun.
(method does not account for image crawlers)
Re: (Score:2)