Website Archiving Prevents Website Extinction

     

The article on ReadWriteWeb "After 20 Years, Is The Website About To Become Extinct?" is really interesting for web archivists, because the discipline is indeed at a turning point w.r.t. being able to collect modern websites.

As a web archivist, dealing with increasing complexity and technology evolution is something I deal with daily at Hanzo Archives. Our customers build their websites and social media presence to deliver the best customer experience possible - and I have to archive it all for them, however complex it is.

Website technology and complexity (and by website, I include social media, or better still, social web) can be described on a continuum, from the most simple static sites (remember the 90's?) to sites that are pretty close to application complexity.

<a href="http://www.hanzoarchives.com/wp-content/uploads/2013/05/website-complexity.jpg"><img class="alignright size-full wp-image-215" alt="website-complexity" src="http://www.hanzoarchives.com/wp-content/uploads/2013/05/website-complexity.jpg" width="577" height="183" /></a>

As the article implies, the more complex websites are becoming increasingly "application-like", which has serious implications for archivability. Will it still be possible to archive them? At Hanzo, we think "yes", and here's how.

<strong>First Principles</strong>

When you consider all the details involved, it's easy to conclude the idea of capturing a modern website is pretty overwhelming. The breakthrough is to approach the problem from first principles: when you "view" the web, you're actually viewing a data-stream generated in response to actions on your part, within an application that can turn that data-stream into something meaningful (a web browser turns the datastream into a viewable web page). At this level, there is no conceptual reason why this data-stream cannot be captured and played back at a later date.

Obviously, when it come to the detail, it gets pretty hard: recall the phrase "in response to actions on your part". A crawler initiating and recognising responses to actions is problematic for most archivists as the tools they use, crawlers for example, have traditionally only been able to generate a limited set of actions, such as mouse-clicks. In Hanzo, with our in-house developed tools, one of their key capabilities is to generate a far wider range of actions and inputs to a website and recognise, and archive, the results. For example, our crawler can model sequences of events of various types, and we can capture the responses. We use this, for example, to capture the filling in and responses to forms, and search results for a predetermined set of values. There are limits of course, our crawlers don't generate content on the fly, and they don't play games. But if a website can return pre-generated content to a user, we are able to capture it.

<strong>Moving Forwards</strong>

The above article suggests that archiving will get harder - we certainly aren't going to argue with that. But, we do argue that that is what archivists thrive on: being able to capture the modern web is our job, we have to capture the most challenging web content and social media in it's native form. Our engineers are continually creating ingenious solutions to the wide and varied problems that archiving the web poses for us. As a result, four generations of technology development later, our crawlers and archiving tools are very different to those used by other archiving institutions, even the national libraries and the Internet Archive (but we remain compatible thanks to standards). Consequently we are already able to capture sites such as those described in the article, and we are continuing development of new capabilities at web pace.

There are going to be some interesting challenges ahead. Such as the fading importance of the URL (the chrome browser is already trying to kill the address box in the browser, and we are seeing more and more sites with content that can only be reached via a sequence of clicks, rather than a direct URL) and the rising popularity of apps, with the datastreams not being interpreted by a browser but by a custom piece of software. Moreover, the range of inputs a website can accept is growing and of course the sheer volume of content on the web presents it's own challenges.

At Hanzo we like to pride ourselves on our ability to capture most things on the web. Looking at the future web, through our crystal ball of course (admittedly a rather inexact science), we think we are up to the challenges ahead. In our view, the complex application-like websites we're capturing for our customers aren't going to go extinct!

About The Author