Pages of the Web Archive


I noticed this <a href="">blog post today</a>, which is an excellent write up on some of the challenges that we web archivists face and highlights what I think is an important point. First off, after reading the post, I must admit to feeling a twinge of pride.

When Hanzo first started back in 2006, we also used the available, open source web archiving technology. Soon came the realisation that keeping up with the dynamic web was going to be a colossal challenge. This is when we decided to develop our own software, enabling us to capture content of this kind.

Today, we know the right choice was made. The web of 2012 often makes the web of 2006 look rather simple. Our software has kept us in good stead. Already on its fourth iteration, there's now little on the web we can't capture (As you might guess I'm not giving our secrets alway as to *how* we capture it.) Sometimes playback within the web archive presents us with one or two challenges, but are quickly remedied. We also have upgrades in the pipeline to tackle any future issues too.

That’s not what I want to talk about though. The blog post I mentioned illustrates perfectly one of the complexities of the web presenting a challenge for eDiscovery and compliance - it's the fact that a web page isn’t a “thing."

When you view a webpage, you are pulling many small files across the web in a host of different formats (e.g. html, css, javascript, xml, PNG, etc). Each of which is interpreted, and some of which, like javascript or flash, represent full blown programs in their own right. We as readers of the web may see web pages like pages in a document, but a web page has a closer semblance to an ‘app’ than a Word document.

This makes the challenge of web archiving for compliance and eDiscovery two fold.
<li>First off, we need to meet the challenges presented in the blog post and, as you've seen with Hanzo web archiving technology, these challenges are something we regularly solve.</li>
<li>Secondly, we need to get the web content into powerful eDiscovery tools. We do this using a component called the <a title="Web Archive Connector" href="">Web Archive Connector</a>.</li>
Using the Web Archive Connector, we are able to take each page (interesting in itself as pages are an emergent property of websites, but that is a post for another day) and create a package containing the text of the page, a searchable PDF of the page, and a host of metadata including (most importantly) links back to the full native format archive. This package can be loaded into eDiscovery tools, search engines, and many others. With the Web Archive Connector, you get the best of both worlds. All the salient information can be searched and worked with, and it's backed by the depth of the full archive. It also has the ability to ‘run’ the pages, explore the full user experience, and see the original content with all its nuance and dynamic behaviour.
Other twinges of pride I have are for our Web Archive Connector and web capture capabilities. They truly provide the best of both worlds: A page view of the web, and the full native format archive, working together. It's very powerful stuff, and I can’t help feeling that we and our customers have only dipped our toes into the pool of possibilities it opens up.

About The Author