Comment In the Archival Trenches... (Score 5, Informative) 186
As a professional historian who has worked in the National Archives in College Park, MD and at four different presidential libraries, which incidentally are also managed by NARA, I need to interject that this is an immense costly but valuable project.
Remember "the warehouse" from the Indiana Jones movies? NARA is a little like that in terms of size but are better organized. Aisle upon aisle, shelf upon shelf, row upon row, room upon room, floor upon floor, building upon building of neatly indexed banker's boxes with labelled folders of documents. The labels may have been checked by the archivists at NARA, but they may also simply be the labels affixed to the records by the source federal agency. The individual documents in folders are almost never labelled. In the course of my work, I gathered 30k digital pictures of documents over the course of two months. The acquisition process sounds deceptively easy. Look in the index, find key words and request boxes from the archivist. Then you look through folders to locate individual documents. In point of fact, I probably visually scanned 3M pages to see if they were "interesting" and photo worthy for future research, usually taking only a few seconds per page to make a snap judgement. My decisions on which boxes of documents to request were far more time consuming. What is the right keyword for talking about computers in government in 1970? If you said "information automation" then you would be right. A few presidential (Ford especially) libraries have updated electronic files for indexing which is a huge advantage.
On my trips to the archives, it was interesting to see both professionals and amateurs using a range of technologies. I saw really old school researchers using 3x5 note cards and taking notes on legal pads. They sometimes supplemented their work by photocopying really important documents at $.75/copy. Some researchers avoided this cost by using flat bed scanners which they carried in with them. Still other researchers brought in high end digital cameras and tripods. I used a digital camera freehanded. All of these people still need to find a way to actually get to physical proximity with the records. Digitalization would open up a new era in research.
On the metadata issue, most of these records already have copious amounts of metadata recorded in well-established fields that are used by NARA.
On the OCR issue, some documents have hand-written notes on them which would not be machine readable and sometimes are not human readable. It is likely that the documents will have to be digitally scanned and flagged if handwriting is detected.
Making these records available to the general public would be a huge advantage to anyone interested in government and US history. Come to think of it, in terms of size and complexity, it would be a worthy challenge for Google. U.S. government documents run back to the founding of the country and the number of documents only increases over time.