Archives and the Future of News

"Of all our national assets, Archives are the most precious; they are the gift of one generation to another and the extent of our care of them marks the extent of our civilization." - Arthur G. Doughty, Dominion Archivist, 1904-1935

At the dawn of the web I was producing a radio show at WILL, the public broadcasting station at the University of Illinois. We were creating about 40 hours of news and other local programming every week, and at some point it occurred to me that the Internet could be a great way to extend its impact. The General Manager mistook my enthusiasm for actual knowledge, and I soon found myself managing the station website.

Like everyone coding websites in the mid 1990s I learned by doing. I started building websites then and never stopped.

By the late 1990s it became possible to stream audio and video. Internet bandwidth was extremely limited at the time, as most of our audience was still on dial-up. So I downloaded RealProducer software and began encoding RealAudio files at 16 Kbps, and RealVideo files at 128K with an aspect ratio of 320px by 240px. Posting these on the web for anyone in the world to access them on demand seemed like a miracle.

Then Stuff Happened

My first few WILL websites are no longer online, although samples are represented in the Internet Archive’s Wayback Machine. WILL’s RealMedia files no longer play at all because the server died and today nobody downloads RealPlayer. And by today's standards they would sound and look like crap anyway.

Media formats on the web have changed faster than any previous form of media. Real, QuickTime, Window Media, all gone in a Flash. At some point it dawned on me that unless I saved the original media items from which the web versions were encoded, all the media we published online would soon become obsolete. So I began saving the source files, and I started a database of everything I saved.

I didn’t know it at the time, but that’s when I became an archivist.

Archival Practice

The World Wide Web is the most powerful platform for communication, media, and knowledge ever devised. The ability to link one item to another allows us to organize and present endless bodies of work for exploration by anyone with a networked device. But like another kind of web, the threads of content we weave on our websites are fragile and ephemeral. Technologies will change, servers will go dark, and files will go missing.

The Library of Congress has on display an original Gutenberg bible printed in 1455. As journalists we aren’t writing bibles, but the stories we create represent the first rough draft of history. We have film from 100 years ago and can still view it.

Gutenberg Bible in display case at the Library of Congress

It turns out that many people love books and film, and there is now a large body of knowledge and practice around preserving them as physical objects. Many books and films have recently been digitized, and archival practices have been extended into the digital realm.

Archivists hold dear certain core principles, such as “always preserve the original materials if you can,” and “if you don’t know what you have, you don’t really have it.” With digital media this is really hard, especially if you have thousands of digital files.

Which was exactly my problem at WILL. Fortunately by this time I had become a true nerd. I started reading up on digital preservation and metadata standards, and connected with communities like the Association of Moving Image Archivists. I became involved in the PBCore metadata project and the American Archive. I took what I learned about archival practice and applied it to the websites I was building. I want to share with you one of the most important things I learned from this work.

A Website is Not an Archive

The web has become the primary publishing point for nonprofit news. If your site is more than a couple of years old, you may have hundreds or even thousands of stories on it. The last website I built at WILL has more than 35 thousand posts.

That website will probably not last for more that a few years, and neither will yours. In their present form on our existing websites, all of our stories and posts are almost certainly doomed.

We can of course migrate the stories, media, and data to a new platform. WordPress and Largo make that relatively easy, and the sites hosted by INN are in good hands. But a website is only the visible tip of the content iceberg. The photos we publish online are resized and compressed. The web audio format today is MP3 or AAC. The web video format is mostly H.264/MP4. But in a few short years today’s web media formats are going be replaced by the next generation of formats. For accessibility on the next generation of devices we will have to re-encode our online media, ideally from the highest quality source files.

comparison between H264 and H265 video — Compared to today's AVC/H.264, the new HEVC/H.265 maintains video quality while cutting file size and bit rate almost in half - very important in a world of billions of smart phones.

It gets worse: If you’re embedding content hosted by third-party services like SoundCloud, YouTube, or Vimeo, there is no guarantee your media will still be online in the future. Or that the descriptive metadata you uploaded with the media will still be available. The third-party service may go out of business, get acquired by another company, get hit by a rights dispute, change their terms of service, or just delete your content for for whatever reason. In the short history of the web, this has happened many times. (Anyone have a Geocities website? No you don’t!)

What To Do About It

Let’s say you publish a story on your website, and you want to make sure it can be preserved no matter what happens to your website. The solution to this is simple but really hard:

Save the ingredients in their most raw and unprocessed form. Save the copy in plain-text format. Keep the original photographs in their highest-resolution form. With audio and video, save the original files or tapes. Any captions, credits, tape logs, shot lists, scripts, rights clearances, transcripts, and graphic elements should be saved as well. Find a way to securely store all this content.
Record enough information (i.e. metadata) about everything to know what it is and where to find it. This could be as simple as a spreadsheet or a database, depending on the scale of the content and who needs access. You could use an open source asset management system like ResourceSpace, or a system any number of vendors would be happy to sell you.

Of course we should keep the story online as long as possible, ideally at the same url. This becomes more difficult with every website redesign and content migration. Over time, every so-called permalink will almost certainly succumb to link rot.

"If you want to save something online, you have to decide to save it," writes Adrienne LaFrance in The Atlantic. "Ephemerality is built into the very architecture of the web, which was intended to be a messaging system, not a library." In other words, if your content has enduring value to you or others, keeping it online over time will require active curation.

Today’s News is Tomorrow’s History

“The written symbol extends infinitely, as regards time and space, the range within which one mind can communicate with another.” - Samuel Butler, Life and Habit, London: Trubner & Co, 1978

Will anything we publish on the web today be accessible 100 years from now? If the stories we publish have value today, this is an important question.

The New York Times has fully digitized articles dating to its founding in 1851. You can explore this yourself in the TimesMachine. Evan Sandhaus, lead architect of semantic platforms at the Times, says the Internet has forced the Times to confront “challenges that are more often encountered in the library space than they are in the online publishing space.”

I’m sorry to tell you this, but we need to become librarians and archivists along with all our other hats. If what we produce matters beyond today, we need to care for it over time. We need to manage our content for the long haul, and not deceive ourselves into thinking our website is an archive.

Our websites are merely the tip of the iceberg of the archive we should be building.