Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Slashdot.org

Journal sllort's Journal: Attention MySql Experts (a call for comments) 23

Have you ever noticed you can't view old Slashdot comments by users who have posted more than 24 comments? For instance, take a look at this user page. That account has posted 702 comments, of which only 24 are linked.
Slashcode has the ability to display old comments, but it is disabled for Slashdot itself. I talked to CmdrTaco about this in IRC a few days back. I'm not going to bother pasting an entire IRC log in here, but I will present the following summary of what I learned:

  • Slashdot does not enable old comment viewing ('morecomments') because it does not scale to Slashdot's volume (i.e. it generates too much database load).
  • Allowing each Slashdot user to view old comments only once per day would also generate too much load.
  • Allowing each Slashdot user to view only one old comment per day would also generate too much load
  • A patch which would generate a static HTML page per user of old comments, regenerate these static pages every night during offpeak hours for users who had posted new comments would be rejected because it would generate too much load on Slashdot

This is all well and good, and nobody wants to tax Slashdot's servers to death. CmdrTaco wasn't able to mention any specifics about why this operation would generate too much load on Slashdot's servers because he's too tired of explaining technical details about Slashdot.

But I'm curious. Having read the code, it appears to me that MySql just makes a SELECT call on a table full of comments with the most recent comment id as the index every time you load users.pl (given it caches this page, but it still has to be regenerated regularly). I can't figure out why changing the index on this select call would generate as much as 1000x as much load on the MySql database, and neither can anyone else I've asked.
So, my open call for comments is as follows:

  1. Why do you think retrieving old comments is so hardware intensive for MySql/Slash?
  2. Can you think of any way to view old Slashdot comments without modifying what Slashdot is running?

Please don't bother responding "use google", search engines don't archive posts at zero, -1, or in journals, rendering them nearly useless.

I'm sure someone out there is a brilliant MySql programmer who knows why retrieving older records can nearly destroy a database's performance. Enlighten me!

This discussion has been archived. No new comments can be posted.

Attention MySql Experts (a call for comments)

Comments Filter:
  • Depends on table structure perhaps?

    No clue how slashdot's tables are structured, so this is just a possibility, but it could be that old comments are pushed to historical tables and only new comments are kept in the "live" table.

    For instance, the ISP I work for keeps RADIUS information in SQL. We keep only the current month's data in a 'live' table ( which is on a fast RAID set ) and archive previous months data. The previous months data is still accesible, however it is more "expensive" for us to access the archives due to it being on a different RAID set ( more capacity to hold the archives, but slower access speeds ). Possibly Slashdot does something similar with comments.

    Note that we're storing only about 1K per record, while the Slashdot comments are much larger in size, so they may be hitting storage constraints in keeping comments archived even sooner than we do with our records.

    • The available data would support this analysis. Newer comments must for some reason be far 'cheaper' to retrieve than old comments.

      The big question then is why are old comments [slashdot.org] still available, and does linking to them somehow deeply tax the database?
      Just reading Slashcode isn't enough to answer this question, because there might be something funky about how MySql is actually storing the old records that isn't exposed in Slashcode. Thanks for the example.
    • I feel bad about complaining about a very good, free, open source product such as Mysql.

      There are a couple of problems though.

      It can't handle locking.
      Table level locking on large tables is the bottleneck 99% of the time in my experience. You combine an update, with a select over a couple of tables with a complex where clause, and the database is locked until the select completes.

      The optimizer is not always right, and its best to check your queries with 'EXPLAIN' if theyre taking an inordinatly long time.

      Indexes make a huge difference.

      I've found joins are more expensive than I'd expect, especially over more than two or three tables.

      There's a couple of ways to work around these problems.

      Replication [mysql.com] servers, where updates get a lower priority, are great for large reports/selects that won't kill your main DB.

      Merge [mysql.com] tables can be used to split huge tables up into smaller ones. This is useful for time sequence data, where 99% of the time you wan't to query the last 3 months worth of data, but the facility is still there to query the whole table if needed.

      To avoid joins, its a matter of caching smaller, rarely changed tables in memory and taking the RAM hit on the webserver.

      But really, if your determined to write your website using perl & mysql, unless its written smart from the start, it'll be pain throughout.

  • I think a lot of people have complained about the search features being less than advanced. Since everything lives in a database back-end, I don't see why it would be terribly hard to improve things a bit, such as being able to enter date ranges, search by userid, ability to include/exclude sections independently, etc.

    As far as finding old posts, the only thing I can think of is if you use a sig (I have that disabled, cleans up the display somewhat) you could search comments for that text string.

    Or, write a script that takes a starting sid # and requests each discussion (decrementing the sid # each time) at threshold=-1 and greps for the username. That would be a bit extreme though. If you were a perl wizard you could probably write it to return a list of cid's (as html links) for a given username. Even better, you could store all those old sid's locally and then you wouldn't have to request them again. I suppose you could even post them as static pages on a server somewhere and make google index them. The "Slashdot -1 Threshold Wayback Machine" so to speak.

    • Or, write a script that takes a starting sid # and requests each discussion (decrementing the sid # each time) at threshold=-1 and greps for the username. That would be a bit extreme though. If you were a perl wizard you could probably write it to return a list of cid's (as html links) for a given username. Even better, you could store all those old sid's locally and then you wouldn't have to request them again. I suppose you could even post them as static pages on a server somewhere and make google index them. The "Slashdot -1 Threshold Wayback Machine" so to speak.


      The solution you've described is the only one I've come up with as well. Sadly, their robots.txt forbids this, and they classify any attempt to archive their old material as a 'scripted attack against Slashdot'. Personally I think it might be wise of them to let us archive it before VA folds, but who knows.

      Maybe I should break the law? It's just that I've never used a script to access Slashdot, ever. Why break the rules now?
  • Of course this wouldn't be retroactive, but maybe as it appends to the table of old comments, it could also append to a static html file a link to the comment, there would be no extra MySQL call, and it would only update whenever someone posted an article (which *shouldn't* be too mucht that /. would crash and burn).
    • and it would only update whenever someone posted an article

      I think you mean it would only update when someone posted a comment. It's a good idea, but remember, the idea of static pages linking to comments was shot down by CmdrTaco as putting too much load on the servers. As to whether your suggested implementation of this (only indexing forward) might improve matters, I can't say. I'll ask if he's ever free in IRC again.
      -s.
  • Sllort, of all people, I thought you would realize Malda just doesn't want to update Slashcode. Oh well...
  • I'm quite intrigued by this question.

    I'm somewhat familiar with MySQL, but at work, I use Oracle 8i (8.1.7, to be precise), running on an SGI machine. A SELECT UNIQUE INDEX - a SELECT statement where an index key is specified in the WHERE clause - against an analyzed table retrieves a row in virtually no time -- if all the columns are less than 2K wide. (2048 bytes is the maximum size allowed by the VARCHAR2 type.)

    If you're storing more than 2K of data in a field, it's gotta be either a CLOB (Character Large OBject) or a LONGRAW (Oracle 7's form of the CLOB/BLOB). Both of these types have hideous performance. (CLOBs can be stored externally -- in separate files, and I think that when they're stored in the same files as the table itself, it's still somewhere far away from the rest of the row, which would jack up seek times tremendously.)

    Anyway, given that Slashdot supports comments of 10,000 characters and beyond, it's fair to say that they use some very large storage object for the comment field.

    If this field isn't stored physically in the same place as the rest of the record, then you're going to thrash your harddrive retrieving it.

    Recently posted comments may very well have their entire logical record (the comment headers & body) all stored in very close physical proximity -- that'd allow for a very fast seek and retrieve. Older comments, however, might get fragmented more, which would cause dramatically longer access times.

    That's my hypothesis.
    • That's by far the best answer I've heard. If it's bullshit, you've fooled me. One thing I don't know is whether the lookup just pulls the CID (the Comment Id) which is just a tiny number, or whether it pulls the whole comment, which is huge. If it pulls the CID, your explanation doesn't hold up. The CID, by the way, is all you need to generate a link to the comment - which is what is required for the search page.
      If, as you hypothesize, it has to pull the entire comment from the table to get the CID... well that's just dumb. Maybe it's time I had a look [slashcode.com].
      -s.
      • Well, a comment list would only need to pull up the CID, title, and date. That would be a very fast operation.

        Your question, however, was about *viewing* old comments, which would (obviously) require a database pull of the entire comment body.

        You're right, all you would need to make a link is the CID, and that would be fast; but if users actually clicked those links, then it'd have to pull the whole database record to retrieve the comment body -- probably very slow for reasons I outlined above.

        Therefore, slashdot isn't doing the "cheap" link generation because letting users click those links would have a drastic effect on their server.
        • I need to clarify then. What Rob is talking about is generating an index of links to comments older than the first 24. This is the operation that is too expensive to perform, even once per user per day. I'll try to edit the story to be more clear.

          Maybe you're right that he doesn't want people viewing old comments, but then, Google lets you do that. And so does search.pl.
  • It's no more database intensive to look up older comments(i.e. comments 24-48, 48-72, etc) than it is to say, browse through your friends list and look at their last 24 comments.

    As a matter of fact mysql is particularly well suited to database "paging"(scrolling through recordsets a given number of records at a time) They are the only database I know of that uses the LIMIT keyword. So, assumming the records are stored and an index is kept on the user comments, it's no harder to look up comments 24 through 48(or any other set of comments for that matter) than it is to look up the first 24 comments.

    The only possible way this is causing an extra "load" is that you need extra DB space to save all the comments. So, if the slashcode people archive all comments into plain HTML files and delete old comments out of the database, that could be the reason you can't look up old comments. But that has absolutely nothing to do with the server's CPU load or bandwidth, which Taco seemed to imply would be hammered. In that respect he's full of shit.

    • The only possible way this is causing an extra "load" is that you need extra DB space to save all the comments.

      Taco has stated that the 'morecomments' ability is available to editors. Only users are prevented from looking up old comments. So, the information has not been archived in any way. The table is there, but it's too load intensive for users to search it. According to Taco.
      I take it you disagree?
    • As a matter of fact mysql is particularly well suited to database "paging"(scrolling through recordsets a given number of records at a time) They are the only database I know of that uses the LIMIT keyword

      I believe MSSQL supports that too. In the SQL Admin, you can right click on a table, choose "display" and "first 100 records" or something like that.
  • That option will soon be available to subscribers only. :)

    Along with the following other options:

    • Search comments text; Search using boolean queries
    • Change scores of comments in your own journal; Choose journal archivation time; unarchive journal; erase journal comments; include pictures in journal entries.
    • See your numeric karma.
    • Use HTML entities in your comments.
    • Find who moderated you.
    • Use the brand new shiny Slashdot Personals [slashdot.org] system.
    • Hide the CowboyNeal poll option.
    • Gain access to the Kathleen Fent picutres area

    Cool, isn't it?

  • It may be that they want to limit the amount of content online. Even if genterting the index of comments could be done inexpensively, the extra traffic of people clicking around each others old comments would add extra traffic, all of it hitting the database. I would also agree with other posters that older data is likely stored in a different system...

The Tao is like a glob pattern: used but never used up. It is like the extern void: filled with infinite possibilities.

Working...