| November 30 2015

Where native format web archives live — forever


How does Hanzo collect and preserve web and social media content?

  • Hanzo collects web and social media content by crawling original websites.

  • Each component that makes up a webpage is downloaded and stored inside a native format web archive. Each component is in the exact form it was captured in, and is stored with a rich collection of metadata, inside WARC files.

  • The WARC files are indexed to enable searching and browsing of content.

  • The WARC files are preserved in WORM-equivalent storage for as long as you want to keep them.


Some Definitions

  • Crawler: A program that reads a website. It captures all the resources that make up each page, such as style sheets, html, javascript, embedded media and files, and so on.

  • Native Format: Original, forensically-sound data that has not been transformed or altered.

  • WARC: Web ARChive file format. This is the ISO standard for preserving archived web content.


What is a Web Archive File (WARC)?

The Web ARChive file format, or WARC, is defined by ISO 28500 here.

  • The standard was created by the International Internet Preservation Consortium (IIPC) — an international body of experts in digital preservation, including people from the Internet Archive and the Library of Congress.

  • A WARC file is a container for archived web resources and metadata that provides structure to the data for processing, indexing, and access.

  • It preserves web data collected by a crawler or other program. If the crawler collects web resources without transformation, then a WARC file will preserve it exactly as it was.

  • It contains a host of relevant metadata that allows a forensic examiner to verify the integrity of all that has been captured.


Tell me more about the ISO 28500 WARC standard

ISO 28500:2009 specifies the WARC file format will do the following:

  • Store both the payload content and control information from mainstream Internet application layer protocols, such as the Hypertext Transfer Protocol (HTTP), Domain Name System (DNS), and File Transfer Protocol (FTP)

  • Store arbitrary metadata linked to other stored data (e.g. subject classifier, discovered language, encoding)

  • Support data compression and maintain data record integrity

  • Store all control information from the harvesting protocol (e.g. request headers), not just response information

  • Store the results of data transformations linked to other stored data

  • Store a duplicate detection event linked to other stored data (to reduce storage in the presence of identical or substantially similar resources)

  • Provide extensibility without disruption to existing functionality

  • Support the handling of overly long records by truncation or segmentation when desired.


Who uses WARCs?

  • The world’s largest memory institutions use WARCs, including the following: Internet Archive, Library of Congress, British Library, Bibliothèque nationale de France, California Digital Library, and many others. The standard is developed and maintained by the International Internet Preservation Consortium.

  • Hanzo Archives use WARCs as the primary storage and preservation medium for native format archive content, metadata, and derived data. This is because the ISO 28500 standard is open, stable, and maintained by experts with very long-term commitments. And, because its design is exactly suited for our purpose: the preservation of archived web resources in their native format.

  • Other companies are beginning to adopt WARC. Of crucial importance, however, is the way the WARC standard is utilised.

Do you have a case study for WARC usage?

A financial institution which provides mutual funds that invest in socially and environmentally responsible companies said they exclude any vendor that could not store in WARC format, giving two reasons:

  1. ISO 28500 WARC is the accepted standard for web archives. The company will not risk going with a proprietary or vendor specific format. The company does not want questions from regulators about using a non-standard approach.

  2. The company’s current vendor is based on proprietary technology who is ceasing this line of business. This leaves the company with PDFs of their websites, but these offer limited utility, and the archive as a whole cannot be migrated to another system in the future.

This is an important lesson. Going down a non-standard route will lead to vendor lock-in and limited options for the future — a grave risk in a regulatory compliance environment.


So what is actually inside a WARC file?

  • A WARC file is a collection of records.

  • Each record relates to an element of a webpage.

  • Record types include request, response, and metadata.

  • Each record has a digital hash to show it has not been altered in any way.

  • The records are combined and compressed into a file.

  • A full archive of a site will comprise many WARC files.

How to use ISO 28500 WARC files

Figure 1. How to use ISO 28500 WARC files

What can I do with a WARC File?

Crawler software allows you to capture web resources and store them in WARCs for preservation purposes.

  • Hanzo crawler provides the ability to capture modern websites and social media, including POSTs, Ajax, embedded rich media, and interactive pages. In a number of tests carried out within EU supported projects, as well as rigorous procurement tests, Hanzo’s crawler is consistently the most comprehensive capture tool available.

Access software allows you to recreate the web content inside the WARC and view it as if you’re viewing the original page.

  • Hanzo access software provides access to WARCs for viewing, inspecting metadata for forensic examination, full text and field searching, and export to third party eDiscovery and review tools. These tools include Relativity, Concordance, Recommind, Symantec Enterprise Vault and Clearwell, EDRM, and a range of other XML/CSV formats.


What open source tools I can use alongside WARCs?

  • WARC Tools: An open source package for reading and writing WARCs, developed by Hanzo as an IIPC project, and with support from Internet Archive.

  • Open Source Wayback: An open source package for viewing web archiving content inside WARCs.

  • wget-warc: A simple open source tool for crawling websites and storing them in WARCs.

  • Heritrix: An open source archival crawler for collecting websites and storing them in WARCs.


How do the records get into the WARC?

The crawler requests each part of the webpage from a web server. It records everything it requests and receives into a WARC file.

Crawling into ISO 28500 WARC files

Figure 2. Crawling into ISO 28500 WARC files


What are WARC Tools?

WARC Tools help the web development community by providing an open source software library, a set of command line tools, web server plug-ins, and technical documentation for manipulation and management of web archive files, or WARC files. Here is a fuller description of WARC Tools.

  • A collection of tools to look at the data in WARC files and extract and repack the content stored in them.

  • An open source package for reading and writing WARCs.

  • Written by Hanzo as an IIPC project and with support from Internet Archive.

  • Used as a library of code for Hanzo and other archiving technologies.

  • Designed to showcase the ISO standard and to encourage the use of WARC files.

WARC Tools and wget-warc are used by The Archive Team.


How do WARC Tools work?

Here are some WARC Tools and what they do.

  • – converts ARCs to WARCs

  • – fixes and repairs broken WARCs

  • – provides a human readable summary of WARC content

  • – unpacks WARC content for delivery into a directory

  • – takes requested WARC content and creates a smaller WARC with that specific content

  • – creates a machine readable index of WARC content

  • – checks to see if a WARC is compliant with ISO 28500



Are there alternatives?

  • Many archiving companies use proprietary formats for web archives, such as MIME HTML files. This can lead to vendor lock-in and is a dead-end from a preservation perspective.

  • Some companies capture limited data (such as text content) or make images of webpages, rather than capturing the native content. For all but the simplest websites, all of the user experience is lost by doing this.

  • Some older archiving tools write the original content directly to disk. This loses all the metadata that the WARC format uses to ensure the integrity and authenticity of the content.


How does my website fit in WARC files?

Imagine examining puzzle pieces and how they fit together.

  • Each page on your website consists of tens to hundreds of individual elements.

  • Our crawler interacts with your website, finds all the links and follows them to discover all the website elements.

  • Many HTTP protocol conversations occur.

  • The content (payload) and associated data is captured by the crawler. It is not altered by this process.

  • Our crawler stores all the content unaltered — all the puzzle pieces — including interactions, protocols, and payload, inside WARC records.


How do I get my website out again?

  • All WARCs are indexed by date, URL, and time-stamp.

  • To browse Hanzo’s archive, you make a request by URL. This is because your website records are in native format.

  • The URL is retrieved from the index.

  • Then the appropriate record(s) are located, unpacked, and displayed in your browser.

  • The user experience is a replica of the full original content and functionality of your website.


In Summary: The Benefits of WARCs

  • High definition, legally defensible capture of web content.

  • Open standard means no vendor lock-in.

  • Open source tools mean you can work directly with the data should you wish to.

  • Native format archive means you get the full user experience of a website long after the original has changed or gone.


Next Steps

The best way to experience Hanzo Archives’ web and social media archiving solutions is through a live demonstration.

Contact us to arrange your one-on-one, online demonstration today.

Related posts

Ediscovery And Compliance Considerations In A Work-From-Home World

Ediscovery And Compliance...

Practical actions to ensure business continuity while reducing risk and preparing businesses to thrive in the future. ...

Read More >
ICYMI at RelativityFest, Behold Hanzo Dynamic Review for Relativity

ICYMI at RelativityFest,...

Last week at Relativity Fest we brought Dynamic Review to the world.     Hanzo Dynamic Review™ for Relativity® brings ...

Read More >
Improving the EDiscovery Ecosystem with Dynamic Native-Format Review

Improving the EDiscovery...

Ecosystems exist in a delicate balance: change one component of the system and the whole thing can collapse like a ...

Read More >