“In our research we develop the software, tools and methods for digital musicology, combining computational analysis and cultural informatics to further the exploration and understanding of music collections. The Great 78 Project is a exciting new resource, since it not only archives a large corpus of important historical music which would otherwise be lost, but does so digitally, enabling us to apply the latest in computing techniques to the study of its high quality audio files. Digital Musicology thrives on large-scale audio collections, and improving our computational methods is best achieved when the community has free and unfettered access to the data, enabling us to collaboratively test and refine our approaches.”
—Dr. Kevin Page, Transforming Musicology project, University of Oxford e-Research Centre
The Internet Archive is preserving physical 78rpm discs and digitizing them to ensure the survival of their content. Accessioning collections means working with collectors and archives to understand and record the history of the collection and physical requirements. The Internet Archive will either pack ourselves or carefully instruct donors how to pack the materials for transportation. Received in a variety of conditions and in locations all over the world, they are often re-sleeved and packaged carefully for shipping to the Internet Archive’s physical archive.
We protect sleeved 78s with clear plastic archival 10″ outer sleeves, and unsleeved 78s are placed in clean paper sleeves. We wrap every five 78s in bubble wrap, then place 4 packages of bubble-wrapped 78s into a white clamshell box with padded with even more bubble wrap. We place 4 white clamshell boxes into each 13″ square box with extra padding around the boxes and secure with tape. For transportation and long term storage, the 13″ boxes are stored on pallets in 3 layers with 9 boxes per layer.
The Internet Archive has experienced very little breakage with this system. Read more on our handling 78rpm discs.
Digitization is done by the George Blood, LP company on special turntables.
Because 78s are so fragile, the best way to make the content available for research and discovery is to digitize them before packing them away for long term storage.
Many of the discs are digitized for the The Great 78 Project by George Blood, LP. Their processes include handling the physical discs with care, photographing the discs, keying the metadata from the disc, and creating the cleanest possible recordings with four different styli. After digitization, the files are uploaded to archive.org and the discs are carefully packed and transported to the Internet Archive’s physical archive for long term storage and preservation.
Over a 2 year period George Blood’s company has digitized 10,000 78rpm sides for the Library of Congress, Thomas Edison National Historical Park, and New York University. The digitization workflows capture a wide range of metadata, from PREMIS events to technical metadata that conforms to AES-57. Having collected a large quantity of data in a systematic, standards-based form, they have been able to analyze the data. The findings have been presented at the Association for Recorded Sound Collections in 2015, at the Joint Technical Symposium in Singapore in 2016, and are published by Indiana University Press.
Brewster Kahle saw George Blood’s presentation at ARSC. George remembers, “We chatted in the hallway after my presentation. He asked about scaling the work to 100,000 discs. We quickly agreed enormous economies were possible when working at that scale. Thus was born the Great 78 Project.”
The first requirement was to maintain the quality standards George Blood had delivered to the demanding specifications of the institutions they already serviced: 400ppi TIFF images with 4 FADGI stars, 96kHz/24bit digitization following established best practices in audiovisual preservation as established by International Association of Sound and Audiovisual Archives (IASA) TC-03 and -04. (Full disclosure – George serves on the IASA Technical Committee).
One solution to digitize 400,000 78s in 24 months would be to do it the same way George Blood has been doing it, but just buy more equipment and hire more people. This is a bad solution. Over 250 years ago Adam Smith in The Wealth of Nations wrote that by dividing production in many small steps each worker would be more efficient. The worker would gain greater depth of expertise in their step, which would yield higher quality and fewer errors. This idea doesn’t work with 1,000 discs because there isn’t enough time for each person to develop the requisite skills. At 100,000 or more it works very well.
The same thing applies to the hardware. A Keith Monks Record Cleaning Machine costs $5,000. If you digitize 10,000 sides, the hardware costs 50¢ per side ($5,000/10,000). If you digitize 100,000 sides it’s a nickel and 400,000 the cost falls to about a penny! A 95% savings. If you give the technician who is cleaning the discs more Keith Monks machines, he can clean more discs at the same time. Depending on the condition and size of the discs, the sweet spot is 3 machines. That makes the technician 3x more efficient, reducing the labor cost by more than 60% (even as it increases the cost per disc to 3¢ because there are 3x as much hardware).
Time and motion studies are a classic way in which efficiencies are discovered and measured. How far does the disc travel down the hall from cleaning to digitization? How can we reduce the number of times the disc is inserted and removed from the sleeve? When does it make sense to group tasks by disc versus doing a task in batches? Does it make more sense to complete and upload the file sets one at a time, or to deliver them to the Internet Archive in batches on a hard drive? If so, how often?
The discs arrive at George Blood’s facility in VERY large batches. Shellac discs are heavy, making shipping them expensive. Palletizing the discs reduces the handling and means the discs won’t be on conveyor belts when shipped. This simplifies the packing. By filling a truck, the discs travel door to door at significantly reduced cost, in a shorter time, and with much less handling. Shipping just a few pallets means the discs will be loaded and unloaded multiple times as the shipping company maximizes how full the truck is at each leg of the journey.
The discs arrive sorted by genre, re-sleeved, and curated for condition and content by Bob George (from the Archive of Contemporary Music) and his team. A high level de-duping is done at this stage, too. The boxes are labeled, numbered, and a QR code affixed, following the system the Internet Archive developed for book scanning. This means discs flow right into production.
A technician works in lots of “one box”. Although “one box” doesn’t contain a fixed number of 78s, it limits the handling of the boxes and gives a high level structure for how materials move between work stations. A stack of discs is removed from the box. A bar code is affixed for each side. The bar code is structured to carry information while avoiding collisions within the very large number of items in the Internet Archive. It conveys relationships between multiple sides, albums, the images, etc.
The bar code is scanned. Each technician handles one or more tasks. Based on a log in, the workflow database knows who is working and the task to be performed. By scanning the barcode the database logs who performed which task and when. This helps with quality control (QC) feedback, as well as process development, because we know how long each task takes.
The first George Blood technician affixes the barcode, scans the barcode, then cleans the disc, using up to three Keith Monks machines as described above. The workstations and workflow are carefully designed for ergonomics and to keep the sleeve with the barcode with the disc. The technician who performs most of the cleaning is rather tall and the workstation is built around his physique. We’ve made parts of the workstation easily adjustable for when someone else does cleaning. Everyone working on the project is being cross trained. This helps balance work load, cover when someone is absent, or if work in one area is stopped. Work may be stopped for maintenance, R&D, reconfiguration such as for ergonomics, etc. This step takes a very different amount of time than the digitization. It would be ideal to take a cleaned disc and put it directly onto a turntable. However this doesn’t work well in practice for many reasons, such as timing, noise, and physical layout.
Bar coded and cleaned discs are put back in the boxes. The stacking order and organization has been defined to simplify handling at the digitization stage.
A stack of discs is removed from the box and placed into a bin. The bin allows easy access to the discs, and is at eye level for ease of the digitization engineer. A disc is selected, and the bar code scanned. This tells the work flow database who is digitizing which disc. The barcode is used to create the file name used for the image and the digitization. The disc is place on the turntable. The high resolution camera is mounted directly above the spindle. In this way the disc is always centered in the image, always the exact same distance from the camera, and can easily be rotated so the text is straight. The image is captured, cropped to include the label and lead-out area to display the matrix number.
During the 78rpm era there were no standards for speed, stylus size, or record/playback equalization. Within the trade there is broad agreement that optimizing playback requires both knowledge of the documentation that’s available on these parameters for each label over time, and some amount of judgment. There are many reasons why judgment is necessary. One reason is that the disc may be worn from being played many times with the correct stylus size. Better results may come from using a different (“the wrong”) size stylus because it sits in a portion of the groove that is in better condition. But there’s no free lunch. Using a smaller size may mean a noisier transfer as it plays a less cleanly molded part of the disc. Using a larger size may increase tracing distortion that is the result of the larger size not fitting all the way to the bottom of the smaller grooves of higher frequencies. Another example is the disc was recorded slower than normal so a longer work would fit within the limited recording time available on a 78rpm side. When played at the “normal” speed, the piece will play too fast. The same may be true of the record/play equalization. Maybe the original engineer or producer deliberately used the “wrong” settings so when the disc was played with the “correct” settings it sounded punchier, or smoother, or warmer, or whatever effect they were trying to create. Now we’re judging between “correct” versus “original intent.” Professionals argue over whether the goal is “correct” or “sounds best”. We also argue over the meaning of correct and sounds best!
Some things you can know, some thing you can probably be right but you won’t know for sure, and others are simply a matter of taste.
In George Blood’s JTS paper he shows how there is substantial doubt about speed. Many factors interact, including performance practice – what is standard pitch at this time and location, did they play a little flat so the high notes were easier to reach, etc. – the limits of recording and playback technologies of the time – recording turntables might have been weight driven, phase lock loops didn’t exist, etc. – and how these factors interact. As shown from the data he collected and analyzed, while one might be able to make a convincing case for a given side, overall it is not possible to know with certainty what the correct speed is. Therefore, George Blood has elected to transfer all the discs at 78.26. This has been a common solution in the trade for decades. The listener or next user of the files can change the speed based on their priorities and judgment.
A similar set of issues surrounds stylus size. There is substantial evidence to support stylus size selection. This information is the first stop. As each stylus size and shape will sound different, a judgment is made regarding what sounds best. Marcos Suieros has presented at IASA and ARSC on his work in this area. It is cited in George Blood’s paper. Marcos carefully transferred half a dozen different discs in different genres, with different stylus sizes. He had a colleague rename the files to obscure their origin. The files were shared with professionals in the field who each expressed their preference. Two of the participants, Marcos and George, took the test twice. Not only is there no agreement on which stylus sounded best, the distribution of choices is random. Even the two engineers who took the test twice didn’t choose the same stylus size both times!
Unlike speed and equalization, the physical interaction between the stylus and disc means you cannot change anything about the choice later. For this reason George Blood has chosen to deliver 4 different stylus sizes. There are many positive outcomes from this decision. The casual listener can explore the differences and learn. Since the different sizes are captured at the same time with matching signal chains, it becomes possible to edit between the different stylus sizes, choosing the best for each moment in the performance. In this case best may be lower distortion on instrument X, different sound due to inconsistent pressing or wear, lower noise or better noise reduction possible with different sizes, and so on. The feature of multiple stylus sizes captured is unique to the Internet Archive’s Great 78 Project!
For the selection of stylus, the digitization engineer checks the label against a table of known or generally agreed stylus size. This size is placed in one tonearm and auditioned. A few (how many is at the discretion of the engineer) different sizes are auditioned to find the one that “sounds best”. Three other sizes are in the three other tonearms.
For equalization, the engineer again checks the label against a table of known or generally agreed settings, listens to the stylus size s/he preferred for any tweak they judge to make it sound better, and document the playback equalization choice.While it is possible to un-EQ a sound file, George Blood also delivers the files with no EQ. These generally sound terrible. Not surprisingly because discs were never meant to be heard that way. However, it does allow easy re-EQ by the user in the future.
This means George Blood delivers both groove walls of 4 different stylus sizes with and without EQ for a total of 16 channels of audio. The most comprehensive presentation of 78rpm discs ever!
Enhancing Metadata for Description and Findability
Transforming these recordings into meaningful components of the World Wide Web has started by enhancing the metadata, linking from these recordings to other resources, and linking from other resources to these recordings.
Metadata has been augmented by librarians and volunteers in a project coordinated by the Internet Archive. People start by writing information into the reviews of the 78’s on archive.org including where they got the information, for instance from DAHR, Worldcat, 45worlds.com, or discogs.com. From there, administrators promote the information into the descriptions or fields, such as genre and release date, and then often promote the volunteers to do it themselves. The Internet Archive has written code to mine recording dates from sites such as 78discography.com. Please help in this effort.
Linking from these recordings to other resources has already started using acoustic techniques, for instance to link to MusicBrainz’s acoustID system, and also volunteers hand matching to Billboard Magazine articles that came out when the discs were released, to subject specific resources such as Hillbilly music. Links to Wikipedia pages for performers and recordings, and links into digitized books on the web. We see this collection growing to be a resource that brings researchers to the breadth of information about the era and customs.
Linking other resources to these recordings, such as Wikipedia and discographies is just starting, but we hope this proceeds as the collection grows more complete.