Archive Not Found

The Challenge of Persistence of Archival Websites

Posted by Jack Brighton on May 10, 2015

Reading time ~11 minutes

Imagine a global library card catalog where you could retrieve any book by simply touching its title. But sometimes you touch the title and get a message saying “we can’t find that book.” This happens often enough that you begin to expect you may not get the exact book you were looking for. This might be OK, because it’s a big card catalog and there are many similar or related books.

Let’s say you are a cultural heritage institution with a collection of unique and important books. You write grants to connect your books with the global catalog so anyone can find and use them. For the first time people begin finding and using your books, and this contributes invaluable primary resources for scholars and citizens worldwide. The funding agencies are happy and you are happy. Unfortunately, the connection of your books to the global catalog is based on technology that soon becomes obsolete. You don’t have funding or technical resources to upgrade. Soon people can no longer find your books, and the money and time that went into making them accessible has become so much water under the bridge.

At this writing there are some 940 million websites on the World Wide Web hosting 4.71 billion indexed web pages. Each web page has a URL that is unique throughout the world, and always points to a specific web resource. Yet the average lifespan of a web page is only about 100 days, so many URLs yield the error message 404 Not Found.

Our global catalog has a lot of missing books, and many of them belong to us.

Sustainability of Web Presence

For all of its ubiquity in our lives and work, the World Wide Web is barely more than 20 years young. Many of us working in the archival professions have learned a great deal about how Internet technology works, and what it takes to build a website. Increasingly we digitize and post oral histories, film archives, and other audiovisual objects, on our institutional and project-based websites. For our institutions themselves, a credible web presence is no longer optional.

During an Unconference session at the 2015 CLIR Cataloging Hidden Special Collections and Archive Pre-Symposium, a roomful of archivists and technologists considered the challenges of maintaining a web presence for grant-funded archival projects after the funding runs out. In a lively discussion, it became clear that many of the same issues are faced with every institution, and with every project.

We are in the business of not only preserving the items and collections in our care, but also preserving access to them in a world where the technologies of access are rapidly changing.

RealVideo, anyone?

I remember publishing my first web video, and marveling that people all over the world could see it. I used a desktop program called RealProducer to encode the video at a size of 360 by 240 pixels and a bitrate of 128 Kilobytes per second. We hosted these videos on a Microsoft IIS server running Active Server Pages, and used Microsoft FrontPage to add and edit the website content.

Most of these technologies are now gone. The 360 x 240 video that seemed miraculous then is laughable today. Worse, it’s not really even playable.

There was a brief moment when QuickTime for the Web ruled the roost, then Windows Media, then Flash, but those moments are also gone. Today’s web video flavor is H.264/mpeg4, but this will soon be replaced by a new flavor.

The technology stack, digital media formats, software tools, and design trends of yesterday’s web are gone today, and today’s will be gone tomorrow. The emergence of Responsive Web Design techniques makes most websites designed before 2012 look like refugees from the Wayback Machine. To be sure this is an inevitable byproduct of innovation, and each new development is a hallmark of qualitative change for the better.

It also means that archivists and their institutions must keep pace with innovation and change in the building blocks of their web presence. For our projects to mean anything beyond a moment of success and perhaps a good paper, the idea of a “sustainable” web presence must become part of the planning and funding for our archival projects, collections, and institutions.

Factors in the Sustainability of Web Presence

So what is required for a sustainable website? Here’s a brief summary of essential elements discussed during our CLIR Unconference session:

IT Infrastructure

Websites run on web servers, which depend on a stack of technologies. Complexity of the stack can vary, but in most cases these days it’ll be a LAMP stack (Linux, Apache, MySQL, PHP) running on a virtual server, or the equivalent Microsoft version. You can easily contract for this with a vendor, or manage the stack yourself on your own hardware or in the cloud on Amazon Web Services. There are trade-offs in a) cost b) ease of use, and c) level of support. You either need to know a lot about setting up machines and server applications, or pay someone else to do this for you. Paying for hosting is relatively inexpensive for sites with low traffic, and is the easiest part of maintaining a website. For higher-traffic sites the cost will scale.

Web Design and Development Expertise

For most websites hosting archival records and multimedia content, a standard WordPress site just won’t cut it. This means you are designing and developing the website, and these are two different things.

Design is about the structure and layout of the user interface, and understanding what the user experience should be. Design also involves art direction, branding, and production of all the graphic elements, and in most cases knowledge of HTML and CSS. Large design shops break these into separate tasks, and have many people with specific skills.

Development is about coding the website based on the design. This almost always involves setting up a database to accommodate the content model for the website, and the programming needed to add content to the database and display it on web pages. Content Management Systems (CMSs) do much of this work out of the box, but often must be customized based on what the site needs to do for users. Any large web shop will have multiple developers with different strengths.

For archival institutions and projects, you can either have these skills on staff, or contract with an individual or agency. There are good arguments and tradeoffs for either solution:

  • Having web expertise on staff allows active maintenance and enhancement of a website over time. As the web itself continues to evolve, the website can also evolve.
  • Web staff can more easily help archival staff publish and maintain website content, especially multimedia content.
  • Institutional knowledge of how the site is constructed is very important. If you hire out the site to an outside entity, you are less likely to maintain this knowledge.
  • Full-time web design/development skills are in high demand, and the pay scale reflects this.
  • Contracting for web design/development frees the organization from long-term financial commitments to those skills.
  • Agencies with multiple designers and developers are typically able to build very sophisticated websites, in terms of both user interface and functionality.
  • Unless a support contract is in place with the agency, the more sophisticated the site the more difficult it will be for anyone else to maintain it over time.
  • In either case, web staff or agency personnel will eventually come and go. Documentation and institutional knowledge of the website construction and how it works is essential for sustainability.

Changes in web technology and multimedia formats

The stability of HTML, CSS, and JavaScript makes the web itself quite sustainable. The instability of everything else we do on the web makes it break all the time. We can’t rely on today’s web multimedia formats to survive more than a few years. We have film from 100 years ago we can still play. Will any video we publish on the web today be playing in 100 years? 50? Five years maybe, but probably no longer. Most recently, the explosion of mobile devices and browsers changed everything again.

If we are publishing important multimedia content on the web, we must be committed to republishing it in different formats in the future. That commitment requires preserving the highest-resolution digital master of each object. It’s possible we can re-digitize from analog source materials in the future as digitization technologies improve. It’s also likely that analog source materials will degrade, and that costs for digitization in the future will be prohibitive.

Sustainability of web presence for multimedia collections requires an institutional commitment to reformat the collections over time. Digital master files will have to be stored and migrated as storage technologies change. People sometimes ask what is the best preservation format for digital media? The only answer now is a commitment to manage change.

The Cloud is a wonderful thing, but what about climate change?

We and our affiliates and licensors make no representations or warranties of any kind, whether express, implied, statutory or otherwise regarding the service offerings or the third party content, including any warranty that the service offerings or third party content will be uninterrupted, error free or free of harmful components, or that any content, including your content or the third party content, will be secure or not otherwise lost or damaged.”

From the Amazon Web Services Customer Agreement

Much of the world now runs on Amazon Web Services (AWS), Microsoft Cloud, IBM, and any number of smaller cloud computing services (many of which repackage AWS, etc). These services provide for any kind of scale at very efficient costs, including backup services, database applications, and a wide range of other services.

Cloud hosting also includes multimedia platforms like YouTube, Vimeo, SoundCloud, and Flickr Commons. These services are extremely helpful to anyone publishing media on the web, and are used by many archivists in various ways.

Do they provide sustainable solutions for archival institutions and projects? Sure, unless:

  • They are purchased by a competitor
  • They change their Terms of Service in a way that no longer supports our requirements
  • They decide to change their business model
  • They go out of business

In the history of the web, this has happened many times. (Anyone have a Geocities website? No you don’t!)

I think it’s likely that cloud hosting will continue to grow and be important for lots of reasons. But the cloud services of today do not exactly meet the definition of a trusted digital repository.

Social Media is the new broadcasting

If an archival collection has a website that nobody ever visits, does it really exist? Successful websites today are places where content is published, then extends into social media space. Impact with audiences on the web now requires active engagement with them on Twitter, Facebook, Instagram, and a growing and changing cast of social media platforms where people learn about, discuss, and share content. But we can’t effectively farm out social media engagement; it needs to be done by people who know the content and the communities for whom it’s relevant.

If we want to know what our impact is with online audiences, we also need to understand and use analytics. Tools like Google Analytics allow us to see how people are interacting with our website, and guide us to improve it. If funders are interested in the public impact of our work, we now have the means to demonstrate this over time. Effective tracking and use of web and social media analytics requires specific skills we now need in our tool kit.

Questions to and from funders

We say one of our core values is sustainability. But how do we define this? In the context of web presence for archival projects and institutions, our CLIR Unconference discussion raised many questions but few answers. Grants typically discourage use of the funds for hardware and long-term staffing. But sustainability involves on-going expenses.

The Google Doc notes from our discussion contain many more points of reference, ideas, and questions. But one thing stands out as a point of action: We need a discussion with funders on creative strategies for sustainable web presence beyond the life of the project. There may be ways we can leverage the resources of partner and allied institutions, and provide web services and solutions at a more sustainable scale. We might usefully consider the value of situating these services within the archival community, instead of leaving each project to look outside the community for solutions. We might then be able to incorporate proposals and solutions for more sustainable web presence within the RFP process itself.

Sustainable Access to Archives

The Opportunity before us is living up to the dream of the Library of Alexandria and then taking it a step further: Universal access to all knowledge. Interestingly, it is now technically doable.”

Brewster Kahle, founder, Internet Archive

Twenty years ago we considered the web a novel thing. We still browsed the physical stacks when we wanted deep knowledge, and I fondly remember the dim lights and narrow aisles crammed with row upon row over floor upon floor of actual books. There were serendipitous discoveries to be made in those stacks, but more often someone else had already checked out my book.

Today we have a different kind of stack, and the challenge is not the scarcity of books but the persistence of the shelves. This is not so much a technical problem as an institutional one. How are we aligned to provide for the persistence of digital media objects? How can we make them accessible forever? And if that isn’t the objective of our archival work, why not?


Image Credits:

“Tumbeasts servers” by Matthew Inman. Licensed under CC BY 3.0 via Wikimedia Commons