Slashdot is powered by your submissions, so send in your scoop


Forgot your password?

Google's Academic TB Swap Project 190

eldavojohn writes "Google is transferring data the old fashioned way — by mailing hard drive arrays around to collect information and then sending copies to other institutions. All in the name of science & education. From the article, 'The program is currently informal and not open to the general public. Google either approaches bodies that it knows has large data sets or is contacted by scientists themselves. One of the largest data sets copied and distributed was data from the Hubble telescope — 120 terabytes of data. One terabyte is equivalent to 1,000 gigabytes. Mr. DiBona said he hoped that Google could one day make the data available to the public.'"
This discussion has been archived. No new comments can be posted.

Google's Academic TB Swap Project

Comments Filter:
  • by garcia ( 6573 ) on Wednesday March 07, 2007 @12:02PM (#18262634)
    One terabyte is equivalent to 1,000 gigabytes.

    Uhh, no it isn't. It's really 0.9765625 terabytes.
    • by Cristofori42 ( 1001206 ) on Wednesday March 07, 2007 @12:06PM (#18262694)
      umm a terabyte is really 1 terabyte. Though 1 terabyte = 1024 gigabytes not 1000... but whatever.
      • Re: (Score:2, Informative)

        by garcia ( 6573 )
        Thanks for pointing out that I should have been hitting Preview instead of getting First Post :)

        1000GB = 0.9765625 TB, not 1TB.
      • Nope (Score:3, Informative)

        by sheldon ( 2322 )
        How you measure a terabyte depends on whether you are buying disk, or monitoring disk usage on your server.

        The disk manufacturers define it as 1000 megabytes which is 1000 kilobytes which is 1000 bytes.

        The OS measures it as 1024 megabytes, which is 1024 kilobytes, which is 1024 bytes

        Why? Because when you're buying a drive, 750 Gigs sounds bigger than 698.5 gigs.
      • Not acording to NIST (Score:4, Interesting)

        by Ernesto Alvarez ( 750678 ) on Wednesday March 07, 2007 @04:51PM (#18267030) Homepage Journal
        If you want to be strict, the SI defines the "tera" prefix as 10^12, so 1 terabyte = 1000 gigabytes.

        If you want to use the binary values, you might as well use the correct "tebi" prefix. NIST [] says you should, and it looks like the IEC, IEEE and BIPM agree.
    • Re: (Score:2, Informative)

      by wizzard2k ( 979669 )
      From wikipedia:
      (a contraction of tera binary byte) is a unit of information or computer storage, abbreviated TiB.

      1 tebibyte [] = 240 bytes = 1,099,511,627,776 bytes = 1,024 gibibytes

      The tebibyte is closely related to the terabyte, which can either be an (inaccurate) synonym for tebibyte, or refer to 1012 bytes = 1,000,000,000,000 bytes, depending on context.
    • Re: (Score:2, Insightful)

      by AchiIIe ( 974900 )
      Nope, that's wrong

      see: []
      * 1 Terabyte = 1000 Gigabyte
      * 1 Tebibyte = 1024 Gibibyte
    • Re: (Score:2, Insightful)

      by wolff000 ( 447340 )
      WHO CARES?!? I have worked with mathematicians that did not squabble over these terms so why the hell are we?!? My mother who can hardly turn a computer on knows damn well that 1000 megabytes is roughly 1 gigabyte. Now lets get back to the topic. It seems Google would have some brilliant way to push a terabyte through the "tubes" instead of just mailing drives, how archaic.
    • I'm just happy they're not swapping tuberculosis [].
    • by guruevi ( 827432 )
      I'm old and interested enough to know what REALLY happened through the history:

      First, as taught in any school book and computer manual through history (see Apple, Amiga, Microsoft, Commodore): 1024 bytes = 1Kilobyte, 1024 Kilobyte = 1 Megabyte etc. because the computer could only calculate in exponents of 2 (1 and 0) and 20MB (20480 kilobyte) was about the largest size hard drive you could get.

      A Kilobyte is 1024 (2^10) bytes. A Megabyte is 1024 Kilobytes or 1,048,576 bytes (2^20) and a Gigabyte is 1024 Mega
      • by HTH NE1 ( 675604 )
        The annoying part for me today is that flash memory is in powers of two (64 MB, 128 MB, 256 MB, 512 MB, etc.), be it for cameras or in USB thumbdrives, yet the units are metric, not binary (stating 1 MB == 1 million bytes on the packaging).

        When I see a power of 2 next to the units, I expect the units to be in a power of 2 too.
  • Large datasets (Score:5, Informative)

    by BWJones ( 18351 ) * on Wednesday March 07, 2007 @12:03PM (#18262648) Homepage Journal
    This is absolutely the most cost effective way of transferring large amounts of data like this. If you do the calculations on terrabyte size files, sneakernet (of FedEx net) is actually faster and less expensive. We also went to one of Jim Grey's seminars when he was here giving an Organick Memorial Lecture and he made an incredibly compelling demonstration using a variety of data types. We ended up talking with him for some time after about new projects we are engaging in that will also be generating terrabytes of data and his suggestion was to pass applications rather than data which was interesting.

    This is becoming more and more the norm in scientific research and Google's work is quite welcome.

    • by Sobrique ( 543255 ) on Wednesday March 07, 2007 @12:09PM (#18262734) Homepage
      Never underestimate the bandwidth of a lorryload of backup tapes traveling at 60 miles an hour.

      Latency may leave something to be desired though :)

      • Never underestimate the bandwidth of a lorryload of backup tapes traveling at 60 miles an hour.

        Close enough.. This is attributable to Andy Tanenbaum according to [] (and one of his books I read).

        Another ontopic remark.

        Google either approaches bodies that it knows has large data sets

        I know people who also approach bodies that they know have large 'data sets', but that doesn't get them a lot of 'bandwidth' ;)

    • by UnknowingFool ( 672806 ) on Wednesday March 07, 2007 @12:11PM (#18262774)
      FedEx delivered what appeared to be a ton of broken office chairs to Google headquarters this morning. When asked for the sender's ID, the severely beaten FedEx courier would only reply that the sender wished to remain anonymous.
      • Mod parent up (Score:3, Informative)

        by ari_j ( 90255 )

        Here's what happened when I FedExed my RMA to Newegg, packed very carefully. Note the bent motherboard - I didn't even know you could do that. The good news is that FedEx paid part of my claim ... they paid $100 plus the $8.33 that the FedEx store charged me to fax in the claim forms. The bad news is that they did not refund my original shipping or pay more than $100 on the over $280 of damage that they did. It also took about 4 hours of phone calls to even convince FedEx that I was not the seller, and

        • The bad news is that they did not refund my original shipping or pay more than $100 on the over $280 of damage that they did.

          Did you buy additional insurance over the $100 you get by default?
          • Re:Mod parent up (Score:5, Informative)

            by MajinBlayze ( 942250 ) on Wednesday March 07, 2007 @03:26PM (#18265958)
            As a former UPS employee, (I worked as a package handler, the guy that beats the shit out of your boxes as he loads them on the truck) I will never ship anything of value without paying extra for the insurance. when you do that, a couple of things happen:
            1. the item goes into a big bag (by itself, not mixed with other items) with red/white stripes, so employess know not to mess with it)
            2. it gets hand-carted to the destination truck, and is the last thing to be loaded, and first unloaded
            3. only seasoned workers ever touch your package, and generally care about the state that it's in
            4. finally, they are good about paying up if the item arrives damaged.
            did I forget to include ???? and Profit!
        • the insurance remedy was to return it to the origination address and ask to see an original purchase receipt to award the insurance claim

          Sorry to nitpick, but this scam has been around for ages - you broke something, oh no! I'll send it to myself and pretend UPS did it. Hell, I even saw it in Seinfeld. Not that you were doing this, but what you tried is pretty suspicious to an outside observer.

          They need SOME proof of value or even that the box was actually full to fight this type of fraud, and the
          • by ari_j ( 90255 )
            Customers in general ought not to be held to know FedEx's corporate structure. I did indeed use the Newegg-provided label. As to my prior shipment broke by UPS, of course I realize that there is the potential for scams. I was shipping Christmas presents to myself because it was cheaper and, on average, safer than trying to check them on my return flight. See my other replies in this thread for more on the FedEx $100 insurance situation.
            • Customers in general ought not to be held to know FedEx's corporate structure.

              I don't know if, in this age, this is wise. With so many corporations buying up major parts of our lives like food, communications, salaries, and transportation, I would challenge you to take a look at the structure of the different entities that affect you daily. The unfortunate fact is that every decision you make needs to be researched to find the most appropriate course of action based on who is behind the marketing. Su
              • by ari_j ( 90255 )
                I would agree with you, except that I don't think that the average consumer should be held to that level of sophistication. This is mostly a cheaper cost avoider issue, for me. Who can more efficiently discover the relevant information? Clearly, the answer here is FedEx.
        • We ended up buying a bunch of these to ship the arrays around in. Cardboard == bad :-)
    • Re: (Score:3, Insightful)

      by dmayle ( 200765 )

      I remember an article I read on this I think back in the year 2000. The was a research scientist who built a standardized platform (That is to say, a specific PC case with a certain number of hard drive bays, and certain network cards) so that he could exchange data with other universities. They would fill up the data on the networked PC, and they could ship it to any of the participating projects, knowing that they'd get back the same hardware in return.

      I remember at the time thinking it was just one of

      • Re: (Score:3, Insightful)

        by BWJones ( 18351 ) *
        Yeah, there have been a number of folks using variations on this theme for a while now. It's been interesting that network performance really has not followed the same performance curve as storage and CPU throughput. Add to that the growing amount of data being pushed through "consumer" pipes from people obtaining broadband and pushing sources such as YouTube and company and you have the makings for a bandwidth crunch. This of course is the reason for separate academic and government Internet paths, but
        • In fact, at some universities engaging in data intensive projects, it is not uncommon for them to occupy the entire bandwidth of the university in off hours to transfer data around the country to various collaborators.

          Even using the full bandwidth between Internet2 connected Unis, it would still take 2~3+ days to transfer 250Tb of data.

          10Gb/s is close to the max you can do with one frequency. That will all change once they start pumping multiple colors down their fiber. Their bandwidth will explode & Go

        • by jcnnghm ( 538570 )
          Internet bandwidth hasn't kept up, but local bandwidth definitely has. My network throughput is more than capable of transmitting data faster than my hard drives are able to write it. And I wouldn't even agree about the net bandwidth. I have a 15mb connection where I used to have a 56k.
      • by chrisd ( 1457 ) *
        That might have been Queue's interview with Jim Gray? Check it out here: a=showpage&pid=43 []


    • Re: (Score:3, Informative)

      by Agent Orange ( 34692 )
      Yup. There was a paper a few years back entitled "terascale sneakernet", by jim gray and a couple of guys at MSFT research division on this. You can find it in the arxiv [].

      This concept has also been applied to such things as the Sloan Digital Sky Survey []. Astronomers do tend to generate a lot of data with large surveys such as this.
    • As the old joke goes, never underetimate the bandwidth of a station wagon full of magnetic tapes. Or a Fed Ex plane full of hard drives. Your choice.
    • We have been sending two DVDs, with about 6-8 GB data, around every month for updates. Now we are trying rsync, which in our view has been more convenient.
      • by Laur ( 673497 )

        We have been sending two DVDs, with about 6-8 GB data, around every month for updates. Now we are trying rsync, which in our view has been more convenient.

        The article and the GP is about sending large amounts of data, as in terabytes. In this discussion, 8 GB is tiny, and is easily downloaded much faster than even express mail. Besides, rsync won't really help if all your data is unique (such as astronomical data). Rsync really helps when very little of your data set changes between updates, such as ba

    • Wait, I'm confused, what happened to the tubes?
    • by Duncan3 ( 10537 )
      Shhhh... It's Google we're talking about, THEY came up with this groundbreaking shift in how data is handled.

      Praise the Google, don't point out they are just doing the same thing as everyone else.

      Google is watching.

  • But are they using station wagons?
  • Never underestimate the bandwidth of a station wagon... []

    Still very much applies today.

    Ryan Fenton
    • The page you linked to had a smart idea. Rather than just have the raw disks, create some sort of architecture inside to allow for rapid transmission of the data from the vehicle upon arrival. I could see specialized vehicles that have been hardened against an accident with an inverter to power the drives that have external fiber optic ports hooked up to massive, high speed RAID arrays to rapidly dump the contents to another system at the location and upload content for the next destination.

      Then a GPS syst

  • How long do you think it will be until some maroon somewhere plunks a hard drive into an unpadded envelope and drops it in the big blue mailbox on the corner?
  • so.. (Score:3, Interesting)

    by mastershake_phd ( 1050150 ) on Wednesday March 07, 2007 @12:08PM (#18262718) Homepage
    Whos going to own the data? I hope Google isnt going to say they do like they want to with the old books theyre scanning. Everytime you download a hubble picture will it have a google watermark?
    • Re: (Score:2, Flamebait)

      Whos going to own the data?

      As always the people of the world own the data. The copyright holders are, however, given a short term monopoly on making copies of it, with certain exceptions.

      I hope Google isnt going to say they do like they want to with the old books theyre scanning.

      Google has not, as far as I know, claimed "ownership" or even copyright on anything they've scanned. They have, however, created their own database of metadata about the works, which they use to enable people to more easily find specific items in the original data.

      Everytime you download a hubble picture will it have a google watermark?

      Umm, maybe. Why do I care if they add watermarks to it? If they are in the way

      • Everytime you download a hubble picture will it have a google watermark?

        Umm, maybe. Why do I care if they add watermarks to it?

        Because there's no water in space! Obviously then there shouldn't be any marks indicating as such on a picture the Hubble telescope took!
    • Re: (Score:3, Interesting)

      by cfulmer ( 3166 )
      The ownership of data is presumably a case-by-case thing that depends on what the data is and how it was acquired.

      For example, Google does not own the copyright on out-of-copyright books that it scans in (nobody does, by definition.) At best, it might own the copyright on the scan that it did, but that's really unlikely--copyright protects creative expression and a straight scan doesn't add any.

      However, they probably have some rights under unfair competition law because they have gone through a lot of work
      • Re: (Score:3, Informative)

        by oneiros27 ( 46144 )

        So, if Google takes the raw data and does that color assignment itself, well, the result is theirs.

        I'm not so sure that the result in theirs, necessarily. They'd need to properly attribute it. Many science archives have rules about how to properly attribute their work.

        Don't get me wrong -- many of the scientists want people to use their data (eg, see The Astronomer's Data Manifesto []), but they also want to know who's using it, because it's how they justify the value of their projects, and the costs incurr

        • by cfulmer ( 3166 )
          Attribution is different from copyright. For example, say you have a novel scientific idea which you write about in some scientific journal and that I read your article and publish my own article, using your idea without attribution.

          Now, what I've done would reasonably upset you, but there is no law (at least in the US) that requires me to attribute your ideas to you. In fact, under those facts, I completely own the copyright in my article and you have no legal remedy. Now, there may be repercussions--I
  • by boyfaceddog ( 788041 ) on Wednesday March 07, 2007 @12:09PM (#18262736) Journal
    The bandwidth of a moving van full of disks.

    Looks like Google is hoarding data. Seems they at least are equating information with power and money. And them that has the power and money makes the rules.
  • by Anonymous Coward on Wednesday March 07, 2007 @12:12PM (#18262784)
    Moe: Say, Barn, uh, remember when I said I'd have to send away to NASA to calculate your bar tab?
    Barney: Oh ho, oh yeah, you had a good laugh, Moe.
    Moe: The results came back today. (reading a printout) You owe me seventy billion dollars.
    Barney: Huh?
    Moe: No, wait, wait, wait, that's for the Voyager spacecraft. Your tab is fourteen billion dollars.
  • Hubble Data (Score:2, Funny)

    by Ikyaat ( 764422 )
    120 TB of data from the Hubble telescope? I wish I was paid to go through that. And this picture is of and this one is a star And a star another star OMG its a FRICKIN STAR
  • SUVs to transport those hard drives. That would be evil.
  • I don't know what the article title conjured up in your head, but when I saw:

    Google's Academic TB Swap Project
    ...the first thing I thought was "why are they swapping around samples of a dangerous infectious disease like tuberculosis?"
    • ...the first thing I thought was "why are they swapping around samples of a dangerous infectious disease like tuberculosis?"

      I'm glad I wasn't the only one!
  • Don't say I didn't warn you guys about this "don't be evil thing." First they start swapping TB for "academic" purposes, then maybe some avian influenza in some apartments around Mountain View, and next thing you know, they'll be a smallpox outbreak and we will coincidentally receive advertisements on gmail that we can buy the cure for a few thousand dollars from one of their Adsense "partners."
  • One terabyte is equivalent to 1,000 gigabytes.

    Hey, where do you think you are ? It's Slashdot here ! Everyone knows that ! What people here want to know is how much that does in Library of Congress...

    The only thing you're getting by saying that is a flamewar between 10 kinds of people, whose who count only in MB (and disagree with you) an those who count in both MB and MiB (and agree with you) !

    For my take on the issue, see this precedent post [] of mine.

    • Actually it 1024 gigabytes using binary units (base 2), we use binary units because formatted capacity is measured in binary units. For exampe: 1 Exabyte = 1(1024) Petabytes = 1(1024)(1024) Terabytes = 1(1024)(1024)(1024) Gigabytes and so on... The formula to convert si units into binary units is si_unit * (125/128) which comes out to 0.9765625. For example: a 750GB hard drive is 750(125/128) = 732.421875 Gigabytes. Also don't forget reserved space... On FreeBSD it's 8% of the format capacity, so 732.421875
      • by alexhs ( 877055 )

        we use binary units because formatted capacity is measured in binary units.

        It seems you haven't read my previous post I was linking to. Please do :)
        Your affirmation is wrong. The correct affirmation would be "we use binary units because some OSes reports formatted capacity in binary units".

        Proof I've read your post in its entirety is that I was going to write "MS Windows" (like I did in the aforementionned post) instead of "some OSes" :) . My server at home is a FreeBSD, I launched fdisk and it reports size in "Meg", neither MB nor MiB. So I can't say :) What command did you ente

  • "The moral of the story is: Never underestimate the bandwith of a station wagon full of tapes hurtling down the highway."

    -Andrew Tannenbaum
  • Mr Dibona, who is a long-standing Linux evangelist, said: "I am comfortable with where Google is operating. People are often upset and feel we should be releasing more.

    "And I agree; I would love to release more. It's more a function of engineering time, than it is a function of desire."

    I call B.S. "Lack of engineering time" is why we haven't seen the source to the core search engines or gmail?

  • I've been thinking that the only home use app lots of HD storage space would be A/V. Now, I guess when 10 PB of HD are $100-1120, then we'll be able to get copies of these 120 TB of hubble data or TBs of other datasets to fill up those future home PB HDs. One day we'll need home exabyte HD to store and play around with public PB datasets.

    I can only hope that bandwidth can keep up. How long would it take to transfer a 120 TB bit torrent file over either cable or dsl?

    Well, maybe we'll have small TB USB flashd
  • ...that a researcher sends them all the printouts of his/her data... on greenbar...

  • ...what does this new P2P technology mean for me? I guess the RIAA is really in for it now.
  • ...why not tapes? (Score:4, Interesting)

    by Penguinisto ( 415985 ) on Wednesday March 07, 2007 @01:18PM (#18263766) Journal
    I understand the whole "HDD w/ a common filesystem = more compatibility" thing, but wouldn't it be easier to simply send along some tapes of a type appropriate to the format/type that the scientific institution uses? LTO-3 can do 800GB compressed, SDLT can do up to 600... and neither is susceptible to data loss when it gets bounced too hard by FedEx/UPS/DHL/Whatever. (plus it would make for a lighter package, wouldn't require some poor IT schmuck to disassemble a server or wait forver for USB to transfer all of it, etc...)

    I'm not criticizing or anything; just curious is all.


    • 1.3Tb each or so. About $150,000. the drive is about $5500. $155,000 in total. A 750Gb hard disk costs about $1000. so it'd cost about $160k to do the same with hard disks.

    • by Laur ( 673497 )

      wouldn't it be easier to simply send along some tapes of a type appropriate to the format/type that the scientific institution uses?

      There are basically two reasons one would choose to use HDDs over tapes: compatibility and price.

      Compatibility: Sure, one scientific institution may have standardized on a specific type of tape, but what about all the rest? Pretty much everyone in the world can read a standard HDD formated with a well known filesystem.

      Price: what is the cost of HDDs vs. tapes per gigabyt

    • Re: (Score:2, Interesting)

      by kulover ( 967626 )
      The reason for not using tapes is exactly because of the compression. The time it takes to compress that data and then send the data to the tape takes a lot of time. That same process would have to be repeated on the other end.

      Besides, using HDD for transfer means immediate access to the same data on the other end with speeds that are unmatched with tape backup systems. It might also be worthy to note that data sets that large usually are stored on large RAID systems like this one from LSI Logic, http []
    • Re: (Score:3, Interesting)

      by K8Fan ( 37875 )

      The "TeraScale SneakerNet" paper posted earlier [] anticipates and answers that. They ship a fully assembled computer with processor, RAM, OS and network interface. Plug it in to the wall, plug it in to the network and assuming you had previously agreed on a networking protocol, you're rolling as soon as it boots! No restoration, no decompressing, immediate access to the data.

      Does anyone have a Linux distro for this specific purpose? Preferably tiny enough to fit onto a USB key and optimized for bandwidth, p

No amount of genius can overcome a preoccupation with detail.