Journal sllort's Journal: Attention MySql Experts (a call for comments) 23
Have you ever noticed you can't view old Slashdot comments by users who have posted more than 24 comments? For instance, take a look at this user page. That account has posted 702 comments, of which only 24 are linked.
Slashcode has the ability to display old comments, but it is disabled for Slashdot itself. I talked to CmdrTaco about this in IRC a few days back. I'm not going to bother pasting an entire IRC log in here, but I will present the following summary of what I learned:
- Slashdot does not enable old comment viewing ('morecomments') because it does not scale to Slashdot's volume (i.e. it generates too much database load).
- Allowing each Slashdot user to view old comments only once per day would also generate too much load.
- Allowing each Slashdot user to view only one old comment per day would also generate too much load
- A patch which would generate a static HTML page per user of old comments, regenerate these static pages every night during offpeak hours for users who had posted new comments would be rejected because it would generate too much load on Slashdot
This is all well and good, and nobody wants to tax Slashdot's servers to death. CmdrTaco wasn't able to mention any specifics about why this operation would generate too much load on Slashdot's servers because he's too tired of explaining technical details about Slashdot.
But I'm curious. Having read the code, it appears to me that MySql just makes a SELECT call on a table full of comments with the most recent comment id as the index every time you load users.pl (given it caches this page, but it still has to be regenerated regularly). I can't figure out why changing the index on this select call would generate as much as 1000x as much load on the MySql database, and neither can anyone else I've asked.
So, my open call for comments is as follows:
- Why do you think retrieving old comments is so hardware intensive for MySql/Slash?
- Can you think of any way to view old Slashdot comments without modifying what Slashdot is running?
Please don't bother responding "use google", search engines don't archive posts at zero, -1, or in journals, rendering them nearly useless.
I'm sure someone out there is a brilliant MySql programmer who knows why retrieving older records can nearly destroy a database's performance. Enlighten me!
Table Structure (Score:2)
No clue how slashdot's tables are structured, so this is just a possibility, but it could be that old comments are pushed to historical tables and only new comments are kept in the "live" table.
For instance, the ISP I work for keeps RADIUS information in SQL. We keep only the current month's data in a 'live' table ( which is on a fast RAID set ) and archive previous months data. The previous months data is still accesible, however it is more "expensive" for us to access the archives due to it being on a different RAID set ( more capacity to hold the archives, but slower access speeds ). Possibly Slashdot does something similar with comments.
Note that we're storing only about 1K per record, while the Slashdot comments are much larger in size, so they may be hitting storage constraints in keeping comments archived even sooner than we do with our records.
Re:Table Structure (Score:1)
The big question then is why are old comments [slashdot.org] still available, and does linking to them somehow deeply tax the database?
Just reading Slashcode isn't enough to answer this question, because there might be something funky about how MySql is actually storing the old records that isn't exposed in Slashcode. Thanks for the example.
Re:Table Structure (Score:1)
There are a couple of problems though.
It can't handle locking.
Table level locking on large tables is the bottleneck 99% of the time in my experience. You combine an update, with a select over a couple of tables with a complex where clause, and the database is locked until the select completes.
The optimizer is not always right, and its best to check your queries with 'EXPLAIN' if theyre taking an inordinatly long time.
Indexes make a huge difference.
I've found joins are more expensive than I'd expect, especially over more than two or three tables.
There's a couple of ways to work around these problems.
Replication [mysql.com] servers, where updates get a lower priority, are great for large reports/selects that won't kill your main DB.
Merge [mysql.com] tables can be used to split huge tables up into smaller ones. This is useful for time sequence data, where 99% of the time you wan't to query the last 3 months worth of data, but the facility is still there to query the whole table if needed.
To avoid joins, its a matter of caching smaller, rarely changed tables in memory and taking the RAM hit on the webserver.
But really, if your determined to write your website using perl & mysql, unless its written smart from the start, it'll be pain throughout.
Slashdot's search (Score:2)
As far as finding old posts, the only thing I can think of is if you use a sig (I have that disabled, cleans up the display somewhat) you could search comments for that text string.
Or, write a script that takes a starting sid # and requests each discussion (decrementing the sid # each time) at threshold=-1 and greps for the username. That would be a bit extreme though. If you were a perl wizard you could probably write it to return a list of cid's (as html links) for a given username. Even better, you could store all those old sid's locally and then you wouldn't have to request them again. I suppose you could even post them as static pages on a server somewhere and make google index them. The "Slashdot -1 Threshold Wayback Machine" so to speak.
Re:Slashdot's search (Score:1)
Or, write a script that takes a starting sid # and requests each discussion (decrementing the sid # each time) at threshold=-1 and greps for the username. That would be a bit extreme though. If you were a perl wizard you could probably write it to return a list of cid's (as html links) for a given username. Even better, you could store all those old sid's locally and then you wouldn't have to request them again. I suppose you could even post them as static pages on a server somewhere and make google index them. The "Slashdot -1 Threshold Wayback Machine" so to speak.
The solution you've described is the only one I've come up with as well. Sadly, their robots.txt forbids this, and they classify any attempt to archive their old material as a 'scripted attack against Slashdot'. Personally I think it might be wise of them to let us archive it before VA folds, but who knows.
Maybe I should break the law? It's just that I've never used a script to access Slashdot, ever. Why break the rules now?
Along with static pages (Score:1)
Re:Along with static pages (Score:1)
I think you mean it would only update when someone posted a comment. It's a good idea, but remember, the idea of static pages linking to comments was shot down by CmdrTaco as putting too much load on the servers. As to whether your suggested implementation of this (only indexing forward) might improve matters, I can't say. I'll ask if he's ever free in IRC again.
-s.
Re:Along with static pages (Score:1)
Come on sllort... (Score:1)
Re:Come on sllort... (Score:1)
There's some other reason.
One hypothesis (Score:2)
I'm somewhat familiar with MySQL, but at work, I use Oracle 8i (8.1.7, to be precise), running on an SGI machine. A SELECT UNIQUE INDEX - a SELECT statement where an index key is specified in the WHERE clause - against an analyzed table retrieves a row in virtually no time -- if all the columns are less than 2K wide. (2048 bytes is the maximum size allowed by the VARCHAR2 type.)
If you're storing more than 2K of data in a field, it's gotta be either a CLOB (Character Large OBject) or a LONGRAW (Oracle 7's form of the CLOB/BLOB). Both of these types have hideous performance. (CLOBs can be stored externally -- in separate files, and I think that when they're stored in the same files as the table itself, it's still somewhere far away from the rest of the row, which would jack up seek times tremendously.)
Anyway, given that Slashdot supports comments of 10,000 characters and beyond, it's fair to say that they use some very large storage object for the comment field.
If this field isn't stored physically in the same place as the rest of the record, then you're going to thrash your harddrive retrieving it.
Recently posted comments may very well have their entire logical record (the comment headers & body) all stored in very close physical proximity -- that'd allow for a very fast seek and retrieve. Older comments, however, might get fragmented more, which would cause dramatically longer access times.
That's my hypothesis.
Re:One hypothesis (Score:2)
If, as you hypothesize, it has to pull the entire comment from the table to get the CID... well that's just dumb. Maybe it's time I had a look [slashcode.com].
-s.
Re:One hypothesis (Score:2)
Your question, however, was about *viewing* old comments, which would (obviously) require a database pull of the entire comment body.
You're right, all you would need to make a link is the CID, and that would be fast; but if users actually clicked those links, then it'd have to pull the whole database record to retrieve the comment body -- probably very slow for reasons I outlined above.
Therefore, slashdot isn't doing the "cheap" link generation because letting users click those links would have a drastic effect on their server.
Re:One hypothesis (Score:2)
Maybe you're right that he doesn't want people viewing old comments, but then, Google lets you do that. And so does search.pl.
Re:One hypothesis (Score:2)
As for Google. Well. Wouldn't we all love to have a server farm the size of a small nation?
Taco is full of shit (Score:2)
As a matter of fact mysql is particularly well suited to database "paging"(scrolling through recordsets a given number of records at a time) They are the only database I know of that uses the LIMIT keyword. So, assumming the records are stored and an index is kept on the user comments, it's no harder to look up comments 24 through 48(or any other set of comments for that matter) than it is to look up the first 24 comments.
The only possible way this is causing an extra "load" is that you need extra DB space to save all the comments. So, if the slashcode people archive all comments into plain HTML files and delete old comments out of the database, that could be the reason you can't look up old comments. But that has absolutely nothing to do with the server's CPU load or bandwidth, which Taco seemed to imply would be hammered. In that respect he's full of shit.
Re:Taco is full of shit (Score:2)
Taco has stated that the 'morecomments' ability is available to editors. Only users are prevented from looking up old comments. So, the information has not been archived in any way. The table is there, but it's too load intensive for users to search it. According to Taco.
I take it you disagree?
SQL LIMIT keyword (Score:1)
I believe MSSQL supports that too. In the SQL Admin, you can right click on a table, choose "display" and "first 100 records" or something like that.
More comments.. [nvws] (Score:1)
Along with the following other options:
Cool, isn't it?
Just browsing old link (Score:2)