Five Tips For Digital Preservation of Web Archives

     

There are two broad concepts for preservation of web content:
<ol>
<li>Preservation in the context of maintaining accessibility of web content in the long term</li>
<li>Preservation in the context of E-Discovery</li>
</ol>
This post is concerned with the first of these, maintaining accessibility of web content in the long term. I've already written about <a title="Where Does a Web Archive Fit Into the eDiscovery Reference Model?" href="http://www.hanzoarchives.com/where-does-a-web-archive-fit-into-the-ediscovery-reference-model/">web and social media preservation for eDiscovery</a>.

<strong>Maintaining accessibility of web content in the long term</strong>

Preservation for web archives is a subset of the activities and ideas defined more generally for digital preservation. <a href="http://en.wikipedia.org/wiki/Digital_preservation">Digital preservation is defined in Wikipedia</a> as:
<blockquote>"the active management of digital information over time to ensure its accessibility."</blockquote>
These activities require constant and ongoing attention to avoid digital obsolescence, such as abandonment of software or encoding technologies, file-format evolution, protocol enhancements, etc. Imagine a web archive of a typical corporate website, captured now, in March 2011. Then imagine accessing the archived content in 10 or 15 years time. Will the embedded Flash still work? Will the video play. Will the web browser of 2026 tolerate todays poor HTML markup? Will the Javascript run? If I had to guess, the answer to most of these questions will probably be 'not entirely'.

Web browsing technologies of 2026 are likely to be quite different to those of today due to the rapid and accelerating pace of development. Thanks to the <a href="http://archive.org/">Internet Archive's Web Pioneers Collection</a> we can see evidence of past evolution of web technologies. Thanks to open standards and W3C, these websites remain reasonably functional, as their technology is open, and still at core of the web today. However, with the increasing use of proprietary technologies, such as Flash, and dynamic rendering and interactivity using Javascript, the future cannot be so certain for todays websites.

So what does this mean for web content, particularly native format web archive content, as provided by Hanzo Archives? What digital preservation do for us?

Hanzo are focussed on two main processes for active preservation of web content:
<ol>
<li>Migration</li>
<li>Emulation</li>
</ol>
<strong>Migration</strong>

Migration is the transfer of data from one system to another, or conversion of one file-format to another so the resource remains fully accessible and functional.

A web archive example could be as follows. Should an image format become obsolete, a conversion process can be developed, in which files of the original format are converted to a new format. The converter should be devised to avoid loss of image fidelity or functionality.

A web archive can take advantage of such a converter in two ways:
<ol>
<li>In a batch process, convert all instances of the original file format contained in the archive to the new format and metadata records can be updated appropriately to show this conversion has taken place; or</li>
<li>At access-time, convert files of the original format on-the-fly to the new format.</li>
</ol>
In either case, the resource remains accessible in the new browser.

Hanzo keep a close watch on a number of projects around the world, particularly amongst the digital library community. Should a migration ever be necessary for a customer, we are able to insert an access-time migration or batch process. This would not compromise the integrity or authenticity of the archive content as the original will never be deleted or changed. We will simply effect the migration using the best practice techniques we developed around the WARC standard and IIPC community.

<strong>Emulation</strong>

Emulation is the replicating of functionality of a system. Emulators are very popular in other contexts, such as gaming, where one can find many emulators of obsolete systems from DOS, Atari and Commodore 64's, which can be used to play old games on new machines and operating systems.

Hanzo and the preservation community at large believe there is considerable promise in emulation as a preservation strategy for complex media, as it is relatively easy for us to implement compared to migration. This is especially true for proprietary and "dark" file formats and code: in which case maintaining a licensed copy of the original target environments, and an emulator or virtual machine to run them, is relatively straight-forward to accomplish.

A web archive example could be as follows. Create a software emulator for a standard PC that can run Windows today, ensuring the software is well constructed as portable code, it should be possible to keep the emulator running for years to come. It will then be possible in the future to run today's operating systems and browsers within the emulator to access a web archive and see how that archive content would have looked.

<strong>Hanzo's Preservation Plan</strong>

Preservation is a long term endeavour, aiming to make archived content accessible in the long term. However, today, we don't need to consider the long term ourselves. We only need to consider preservation for the foreseeable future; provided we can preserve our web archives for this generation, using such strategies discussed here, we will be able to hand the problem to the next generation to solve for their foreseeable future. In this way, todays web archives will be preserved for the long term.

Hanzo captures and preserves web content for many commercial and government institutions around the world. Our preservation plan is to ensure our customers have the means to access their archived content at any time through pragmatic initiatives such as:
<ol>
<li>Keep track of developments in web technologies and file-formats</li>
<li>Ensure we keep sufficient metadata and indexes to be able to identify "at risk" file formats contained in the archive and actively manage them</li>
<li>Ensure we keep virtual machines with images that are representative of key releases of platform software, including operating systems, web browsers, plugins and so on</li>
<li>Work with the web archiving community as a whole, through organisations like the <a href="http://www.netpreserve.org/about-us">International Internet Preservation Consortium</a> of which we are a member, to ensure full collaboration with the major projects and initiatives around the world.</li>
<li>Continue to work on open standards-based archive technologies, to ensure our customers receive the full benefit of preservation initiatives worldwide.</li>
</ol>
Through these initiative Hanzo will ensure that in the event of an evolutionary step in technology that adversely affects our customers archived web content, we'll have the means, knowledge and tools to keep them accessible.

<strong>Five Tips For Digital Preservation of Web Archives</strong>

As a web content owner within a commercial or government organisation you have some pretty major considerations and responsibilities concerning digital preservation. You need to make sure that whichever archive technology you use, you avoid lock-in to proprietary formats, avoid systems that do not adhere to standards, and ensure best practices for web archive content is always followed. So check your options! Here are five tips for digital preservation of web archives.
<ol>
<li>Your web archive system should store content, without modification, in native format</li>
<li>Make sure it collects metadata about your content and files, captured from the web, and the content itself</li>
<li>Make sure your archive is based on client-side web archiving technology, so that it is independent of publishing platform</li>
<li>Ensure your archive uses ISO 28500 WARC files to store your content - avoid proprietary lock-in?</li>
<li>All of the above to ensure you benefit from the global community of archivists and preservation specialists building on the same best practices and foundations</li>
</ol>
You should ensure your web archive is affirmative in all these regards for an optimal basis for digital preservation of your web content over the long term.

We describe how Hanzo Enterprise Web Archiving meets all of the above criteria for digital preservation in our white paper.

[button]<a title="Web Archiving For Compliance White Paper" href="http://www.hanzoarchives.com/resources/web-archiving-for-compliance-white-paper/">Download Web Archiving White Paper</a>[/button]

About The Author