“In our research we develop the software, tools and methods for digital musicology, combining computational analysis and cultural informatics to further the exploration and understanding of music collections. The Great 78 Project is a exciting new resource, since it not only archives a large corpus of important historical music which would otherwise be lost, but does so digitally, enabling us to apply the latest in computing techniques to the study of its high quality audio files. Digital Musicology thrives on large-scale audio collections, and improving our computational methods is best achieved when the community has free and unfettered access to the data, enabling us to collaboratively test and refine our approaches.”
Dr. Kevin Page, Transforming Musicology project, University of Oxford e-Research Centre

Internet Archive Special Collections Room

4 tonearm 78rpm turntable of George Blood LP

We are preserving physical 78rpm discs and digitizing them to ensure the survival of their content.   Accessioning collections means working with collectors and archives to understand and record the history of the collection and physical requirements.  We will either pack ourselves or carefully instruct donors how to pack the materials for transportation.   Received in a variety of conditions and in locations all over the world, they are often re-sleeved and packaged carefully for shipping to the Internet Archive’s physical archive.

We protect sleeved 78s with clear plastic archival 10″ outer sleeves, and unsleeved 78s are placed in clean paper sleeves. We wrap every five 78s in bubble wrap, then place 4 packages of bubble-wrapped 78s into a white clamshell box with padded with even more bubble wrap.  We place 4 white clamshell boxes into each 13″ square box with extra padding around the boxes and secure with tape. For transportation and long term storage, the 13″ boxes are stored on pallets in 3 layers with 9 boxes per layer.

We have experienced very little breakage with this system. Read more about how we handle 78rpm discs.

Digitization is done by the George Blood, LP company on special turntables.


Packing 78s


Because 78s are so fragile, the best way to make the content available for research and discovery is to digitize them before packing them away for long term storage.

Many of the discs digitized for the The Great 78 Project are done by George Blood, LP. We have worked together to make sure the physical discs are treated with care, the discs are photographed, metadata is keyed in from the disc, and the cleanest possible recordings are created with four different styli. After digitization, the files are uploaded to and the discs are carefully packed and transported to the Internet Archive’s physical archive for long term storage and preservation.

Over a 2 year period we digitized 10,000 78rpm sides for the Library of Congress, Thomas Edison National Historical Park, and New York University. Our digitization workflows capture a wide range of metadata, from PREMIS events to technical metadata that conforms to AES-57. Having collected a large quantity of data in a systematic, standards-based form, we were able to analyze the data. Our findings have been presented at the Association for Recorded Sound Collections in 2015, at the Joint Technical Symposium in Singapore in 2016, and are published by Indiana University Press. [insert 78project link]

Brewster Kahle saw my presentation at ARSC. We chatted in the hallway after my presentation. He asked about scaling the work to 100,000 discs. We quickly agreed enormous economies were possible when working at that scale. Thus was born the Great 78 Project.

The first requirement was to maintain the quality standards we had delivered to the demanding specifications of the institutions we were already service: 400ppi TIFF images with 4 FADGI stars, 96kHz/24bit digitization following established best practices in audiovisual preservation as established by International Association of Sound and Audiovisual Archives (IASA) TC-03 and -04. (Full disclosure – I serve on the IASA Technical Committee).

One solution to digitize 400,000 in 24 months would be to do it the same way we have been doing it, but just buy more equipment and hire more people. This is a bad solution. Over 250 years ago Adam Smith in The Wealth of Nations [insert link] wrote that by dividing production in many small steps each worker would be more efficient. The worker would gain greater depth of expertise in their step, which would yield higher quality and fewer errors. This idea doesn’t working with 1,000 discs because there isn’t enough time for each person to develop the requisite skills. At 100,000 or more it works very well.

The same thing applies to the hardware. A Keith Monks Record Cleaning Machine costs $5,000. If you digitize 10,000 sides, the hardware costs 50¢ per side ($5,000/10,000). If you digitize 100,000 sides it’s a nickel and 400,000 the cost falls to about a penny! A 95% savings. If you give the technician who is cleaning the discs more Keith Monks machines, he can clean more discs at the same time. Depending on the condition and size of the discs, the sweet spot is 3 machines. That makes the technician 3x more efficient, reducing the labor cost by more than 60% (even as it increases the cost per disc to 3¢ because there are 3x as much hardware).

Time and motion studies are a classic way in which efficiencies are discovered and measured. How far does the disc travel down the hall from cleaning to digitization? How can we reduce the number of times the disc is inserted and removed from the sleeve? When does it make sense to group tasks by disc versus doing a task in batches? Does it make more sense to complete and upload the file sets one at a time, or do deliver them to the Internet Archive in batches on a hard drive? If so, how often?

The discs arrive at our facility in VERY large batches. Shellac discs are heavy, making shipping them expensive. Palletizing the discs reduces the handling and means the discs won’t be on conveyor belts when shipped. This simplifies the packing. By filling a truck, the discs travel door to door at significantly reduced cost, in a shorter time, and with much less handling. Shipping just a few pallets means the discs will be loaded and unloaded multiple times as the shipping company maximizes how full the truck is at each leg of the journey.

The discs arrive sorted by genre, re-sleeved, and curated for condition and content by B George and his team. A high level de-duping is done at this stage, too. The boxes are labeled, numbered, and a QR code affixed, following the system the Internet Archive developed for book scanning. This means discs flow right into production.

A technician works in lots of “one box”. Although “one box” doesn’t contain a fixed number of 78s, it limits the handling of the boxes and gives a high level structure for how materials move between work stations. A stack of discs is removed from the box. A bar code is affixed for each side. The bar code is structured to carry information while avoiding collisions within the very large number of items in the Internet Archive. It conveys relationships between multiple sides, albums, the images, etc.

The bar code is scanned. Each technician handles one or more tasks. Based on a log in, the workflow database knows who is working and the task to be performed. By scanning the barcode the database logs who performed which task and when. This helps with QC feedback, as well as process development, because we know how long each task takes.

The first technician affixes the barcode, scans the barcode, then cleans the disc, using up to three Keith Monks machines as described above. The workstations and workflow are carefully designed for ergonomics and to keep the sleeve with the barcode with the disc. The technician who performs most of the cleaning is rather tall and the workstation is built around his physique. We’ve made parts of the workstation easily adjustable for when someone else does cleaning. Everyone working on the project is being cross trained. This helps balance work load, cover when someone is absent, or if work in one area is stopped. Work may be stopped for maintenance, R&D, reconfiguration such as for ergonomics, etc. This step takes a very different amount of time than the digitization. It would be ideal to take a cleaned discs and put it directly onto a turntable However this doesn’t work well in practice for many reasons, such as timing, noise, and physical layout.

Bar coded and cleaned discs are put back in the boxes. The stacking order and organization has been defined to simplify handling at the digitization stage.

A stack of discs is removed from the box and placed into a bin. The bin allows easy access to the discs, and is at eye level for ease of the digitization engineer. A disc is selected, and the bar code scanned. This tells the work flow database who is digitizing which disc. The barcode is used to create the file name used for the image and the digitization. The disc is place on the turntable. The high resolution camera is mounted directly above the spindle. In this way the discs is always centered in the image, always the exact same distance from the camera, and can easily be rotated so the text is straight. The image is captured, cropped to include the label and lead-out area to display the matrix number.

During the 78rpm era there are no standards for speed, stylus size, or record/playback equalization. Within the trade there is broad agreement that optimizing playback requires both knowledge of the documentation that’s available on these parameters for each label over time, and some amount of judgment. There are many reasons why judgment is necessary. One reason is that the disc may be worn from being played many times with the correct stylus size. Better results may come from using a different (“the wrong”) size stylus because it sits in a portion of the groove that is in better condition. But there’s no free lunch. Using a smaller size may mean a noisier transfer as it plays a less cleanly molded part of the disc. Using a larger size may increase tracing distortion that is the result of the larger size not fitting all the way to the bottom of the smaller grooves of higher frequencies. Another example is the disc was recorded slower than normal so a longer work would within the limited recording time available on a 78rpm side. When played at the “normal” speed, the piece will play too fast. The same may be true of the record/play equalization. Maybe the original engineer or producer deliberately used the “wrong” settings so when the disc was played with the “correct” settings it sounded punchier, or smoother, or warmer, or whatever affect they were trying to create. Now we’re into “correct” versus “original intent”. Professional argue over whether the goal is “correct” or “sounds best”. We also argue over the meaning of correct and sounds best!

Some things you can know, some thing you can probably be right but you won’t know for sure, and others are simply a matter of taste.

In my JTS paper I show how there is substantial doubt about speed. Many factors are interact, including performance practice – what is standard pitch at this time and location, did they play a little flat so the high notes were easier to reach, etc. – the limits of recording and playback technologies of the time – recording turntables might have been weight driven, phase lock loops didn’t exist, etc. – and how these factors interact. As shown from the data we collected and analyzed, while one might be able to make a convincing case for a given side, overall it is not possible to know with certainty what the correct speed is. Therefore, we’ve elected to transfer all the discs at 78.26. This has been a common solution in the trade for decades. The listener or next user of the files can change the speed based on their priorities and judgment.

A similar set of issues surrounds stylus size. There is substantial evidence to support stylus size selection. This information is the first stop. As each stylus size and shape will sound different, a judgment is made regarding what sounds best. Marcos Suieros has presented at IASA and ARSC on his work in this area. It is cited in my paper. Marcos carefully transferred half a dozen different discs in different genre, with different stylus sizes. He had a colleague rename the files to obscure their origin. The files were shared with professionals in the field who each expressed their preference. Two of the participants, Marcos and myself, took the test twice. Not only is there no agree on which stylus sounded best, the distribution of choses is random. Even the two engineers who took the test twice didn’t choose the same stylus size both times!

Unlike speed and equalization, the physical interaction between the stylus and disc means you cannot change anything about the choice later. For this reason we have chosen to deliver 4 different stylus sizes. There are many positive outcomes from this decision. The casual listener can explore the differences and learn. Since the different sizes are captured at the same time with matching signal chains, it becomes possible to edit between the different stylus sizes, choosing the best for each moment in the performance. In this case best may be lower distortion on instrument X, different sound due to inconsistent pressing or wear, lower noise or better noise reduction possible with different sizes, and so on. This is the feature, of multiple stylus sizes captured, is unique to the Internet Archive’s Great 78 Project!

For the selection of stylus, the digitization engineer checks the label against a table of known or generally agreed stylus size. This size is placed in one tonearm and auditioned. A few (how many is at the discretion of the engineer) different sizes are audition to find the one that “sounds best”. Three other sizes are in the three other tonearms.

For equalization, the engineer again checks the label against a table of known or generally agreed settings, listens to the stylus size s/he preferred for any tweak they judge to make it sound better, and document the playback equalization choice.While it is possible to un-EQ a sound file, we also deliver the files with no EQ. These generally sound terrible. Not surprisingly because discs were never meant to be heard that way. However, it does allow easy re-EQ by the user in the future.

This means we deliver both groove walls of 4 different stylus sizes with and without EQ for a total of 16 channels of audio. The most comprehensive presentation of 78rpm discs ever!

Enhancing Metadata for Description and Findability

Transforming these recordings into meaningful components of the World Wide Web has started by enhancing the metadata, linking from these recordings to other resources, and linking from other resources to these recordings.

Metadata has been augmented by librarians and volunteers in a project coordinated by the Internet Archive.   People start by writing information into the reviews of the 78’s on including where they got the information, for instance from DAHR, Worldcat,, or  From there, administrators promote the information into the descriptions or fields, such as genre and release date, and then often promote the volunteers to do it themselves.  The Internet Archive has written code to mine recording dates from sites such as Please help in this effort.

Linking from these recordings to other resources has already started using acoustic techniques, for instances to link to MusicBrainz’s acoustID system, and also volunteers hand matching to Billboard Magazine articles that came out when the disc’s were released, to subject specific resources such as Hillbilly music.  Links to Wikipedia pages for performers and recordings, and links into digitized books on the web.   We see this collection growing to be a resource that brings researchers to the breadth of information about the era and customs.

Linking other resources to these recordings, such as Wikipedia and discographies is just starting, but we hope this proceeds as the collection grows more complete.