Open Standards are Important in Commercial Web Archiving


<a href=""><img class="alignright size-full wp-image-151" alt="warc-peek" src="" width="90" height="112" /></a>Standards ensure a company can have great confidence their web archive will protect their legal interests over the long term.

The market for web archiving services is developing and maturing rapidly. The recent announcement of <a href="">Reed Technology Web Archiving Services </a>(powered by Iterasi) shows that web archiving is now interesting for the incumbents.

One of the PR pieces written about Reed was ReadWriteBiz's blog post "<a href="">Create Your Own with Reed Tech Web Archiving</a>". The title set me thinking, as most commercial web archives are not similar to at all. is an open, publicly accessible, and standards-based web archive. Reed, Iterasi, Hanzo and others are mainly concerned with closed, private commercial web archives.

<strong>What about standards-based?</strong>

Although commercial web archiving is relatively new, the Internet Archive and other memory institutions around the world have long been interested in making sure that the material they gather is stored in a standard way to ensure they can be accessed by future generations. This work has been largely under the auspices of the <a href="">International Internet Preservation Consortium</a> (IIPC), an organisation focused on standards and best practices for web archiving. Hanzo has been a member of IIPC since we started, and the founders for several years before that!

Most significantly a recent outcome of this organisation is the <a href="">Web Archiving File Format, or WARC, now known as ISO 28500:2009</a>. WARC files are used to store and preserve web archive content in an open way, facilitating best practices, system interoperability, and long-term web content preservation. What is more, there exists several open source tools to access and work with this file format.

So why might this be of interest in the field of commercial archiving?
<li>Archived data is expected to be retained for a number of years. Using a standard storage format means that the data is safe, and will continue to be accessible, no matter what happens to the company that originally gathered it.</li>
<li>In the event that there are legal queries over the archived content, the use of standards means that third parties can examine the data using standard tools and come to their own conclusions.</li>
Standards ensure that a company archiving it’s websites can have confidence in that archive to protect its legal interests and to preserve its corporate heritage.

At Hanzo, we aren’t exactly neutral in this conversation. As mentioned we have been members of the IIPC for a while, and took part in the standard setting for WARC. We use WARC to ensure customers like Coca-Cola preserve their websites (lots of them globally) and their social web content, YouTube, blogs, etc. as part of the their regular corporate archives. Other customers capture their web content and messaging for regulatory compliance. Our use of the web archiving standard file format means these customers are not locked into any proprietary data formats or closed services. They choose Hanzo because of the depth and the quality of the archives we capture for them, in native format, and preserved in standards-based web archive files.

Back to the title “Create your own”. On a recent trip to Santa Clara, I handed a USB drive to a customer, containing their first collection, stored inside a set of WARC files, together with a Windows version of our archive access tools. She was able to open this on her PC in 2 mins and browse the archive of their website. That’s creating your own!

About The Author