Hopping on Down an Internet Rabbit Trail To Capture Context: How Hanzo’s hops work when preserving a website

| March 4 2020

Today, a tremendous amount of your organization’s client communications are happening through the representations on your website and social media channels. If you’re subject to regulatory compliance requirements or are involved in potential litigation, then you’re obligated to preserve your online content. It’s simply the only way to demonstrate to a court or a regulatory body exactly what you said to a prospective client.

But it’s called “the web” for a reason: everything on the internet is intricately and extensively interconnected. Every site—even the plain-Jane main landing page for a Google search—has links to other content. Those connections may be as simple as a “contact us” bar at the bottom of the page or as complex as hundreds of links to related internal and external content.

We’ve all fallen down internet rabbit trails at some point, where you start out checking your online bank balance but then click on a tempting link. That link leads to another, then another, and the next thing you know, an hour has passed. The internet is infinite, not to mention infinitely distracting—but the majority of website preservation tools are missing out on most of what’s there.

If your website archiving method is only capturing the front page of your website or only capturing specific enumerated pages within your site, you’re not getting the valuable context that surrounds your content. And the context of your company is what tells your story, interpreting and explaining your messaging.

The solution is simple, but it’s far from easy: you need to capture not just a page but also every page that page links to. These links are what we at Hanzo call “hops.”



How We Define Hops


Hops are the light-green resinous flowers that give beer, especially IPAs, a characteristic bitter flavor. Oh, sorry, were we talking about web preservation?

markus-spiske-rDQqc12yp9s-unsplash
Photo by Markus Spiske on Unsplash

When it comes to archiving websites, a “hop” is simply a link that we follow. Let’s say you’re trying to preserve the financial forecast information that you communicate to clients via your website. First, of course, you’ll want a dynamic native-format capture of your primary financial page. But that page includes dozens, if not hundreds, of other links. It links to your company’s homepage, to information about individual employees, to a customer service page, and to a variety of online tools that may be publicly accessible or hidden behind login screens. It probably also links to other websites, perhaps through stock tickers or market indices.

To get the complete picture of what your customers actually saw on your financial page, you’ll need to collect not only what was on that page but also what those customers saw on any page they linked to from there. Perhaps you’ll also want to look at every page your customers might have linked on from the second webpage or a third. That means following all those hops down all those exponentially increasing rabbit trails.

Note that there are two types of hops: internal and external. Internal hops are links that lead to another page on the original website. If we’re preserving your entire website, we’re going to collect every page of it, so our scope will already include every internal hop.

External hops function exactly the same way, but they’re links to sites with a different domain name. If your blog post links to internal pages but also references a story on another website about what hops are and how they’re used in beer, collecting and preserving that page would require using an external hop.



How Far Do We Hop?


If you’re preserving your website, how many hops do you need? How far does your archivist need to go down every possible rabbit trail?

steve-halama-jMlgFOiJLXc-unsplash
Photo by Steve Halama on Unsplash


You’re probably scratching your head, surmising that this is an unanswerable question. How big is a webpage? How big is the internet itself? There’s no single right answer to how far you should go in preserving your site. Rather, you have to look at your specific website and the specific purposes you have in preserving it to determine how far you want to go.

What sort of information are you after? Who needs that information and what will they need to do with it? What factors will that decision-maker consider in evaluating your content?

Say you’re a retailer under contract with a specific supplier. You agreed that you’d offer for sale every one of that supplier’s kitchen appliances, but the supplier is now claiming that you didn’t.

To prove that you did offer the entire universe of that seller’s wares, you’ll want to capture all of your sales pages displaying those products. That means you want to capture the grid view displaying every type of toaster and blender you sell, showing the supplier’s products listed. You’ll also want to preserve each individual product page, with a full listing of prices and product descriptions, for each of the supplier’s products. You may want to capture internal links about your sale policies, shipping prices, item availability, and product guarantees. When you offer sales or promotional discounts, you’ll need to capture not just the main display announcing the sale but also any fine print shown on that page or hidden away on another.

And what if the individual product pages link to external information about the manufacturer, including its own product descriptions and reviews, or to articles explaining why that supplier’s microwave is the best microwave ever? You may choose to capture all of those pages as well to demonstrate that you were more than offering the supplier’s products for sale: you were actively working to promote them.

This scoping effort—determining what website content should be preserved and how in-depth it should be—can be complex, but it’s the only way to ensure that you collect everything you need. Hanzo’s experts are used to customizing archives and tailoring hops to each particular client’s needs.

Taste-Testing—or Running QA Checks on—Your Hops


But setting up a website archive, even with our jointly determined best estimate about how many hops should be included, isn’t enough to ensure that a preservation effort is working optimally. At Hanzo, we’ve mastered native-format web preservation for dynamic, complex, unstructured online data—and added to it a human quality-assurance process.

battlecreek-coffee-roasters-i22gbC3gFm4-unsplash
Photo by Battlecreek Coffee Roasters on Unsplash


We know that plenty of things can go wrong with scoping hops, especially when preserving information from essentially infinite social media sites like Facebook. Following a single Facebook or Twitter link could mean collecting every piece of data on a continuously updated, unending newsfeed.

Technology + Humanity = Peace of Mind


So, we have two primary methods for catching these types of errors and ensuring that our website archives capture everything our clients need to establish their case or defense or prove their compliance.

First, during initial scoping we use Google indexing to estimate how many links our preservation effort should collect. Between that and test crawls, we establish the baseline for what to expect and check against that.

Second, we use a sample-based approach to check—by hand, with a live person—that our preservation and its hops are working. For small collections of 10 to 20,000 sites, we may check every single link. As the number of hops increases and with it the number of links, we use a sample-based approach, checking a decreasing percentage of links as we move further from the original source.

Together, these QA methods help us ensure that we’re capturing what we meant to and what our clients need.

photo-1525072124541-6237cc05f4f7
Photo by Radu Florin on Unsplash


Don’t let the context of your content get lost on an internet rabbit trail. When preserving your website for ediscovery or compliance, make sure your archive tool is capturing everything you need, both on your site and externally.

Is your archive tool not hopping as far as it should? Hanzo can help. Contact us.

Related posts

Ediscovery 101 For Collaboration: What does a company have to do with its Slack data to meet its discovery obligations?

Ediscovery 101 For...

As the world embraces a more collaborative way of working and uses new applications like Slack to communicate with one ...

Read More >
Ediscovery 101 For Collaboration: What happens if you don’t preserve Slack communications and someone asks for them in ediscovery?

Ediscovery 101 For...

As the world embraces a more collaborative way of working and uses new applications like Slack to communicate with one ...

Read More >
Ediscovery 101 For Collaboration: What does ediscovery have to do with Slack?

Ediscovery 101 For...

As the world embraces a more collaborative way of working and uses new applications like Slack to communicate with one ...

Read More >