Cambridge Analytica Shook Up the Way We View Data Privacy—and Ediscovery Collections

| November 7 2018

After 2018, companies may never view online data collection in the same way again.

Just as the world was turning its attention to the importance of protecting individuals’ data privacy, Facebook stumbled into a series of data misuse scandals that highlighted the weaknesses in how many companies collect website and social media data. Facebook’s losses, however, may ultimately be a gain for ediscovery and regulatory compliance data collections, as they’ve turned a harsh spotlight on the shortcomings of a common collection method.


Facebook and the Terrible, Horrible, No Good, Very Bad Year

Facebook is not having a good year, to put it mildly.

In March 2018, The New York Times and The Guardian and Observer reported that the data-analysis firm Cambridge Analytica, along with its related company, Strategic Communications Laboratories, improperly exploited the personal data of at least 50 million Facebook users to influence voter behavior. (Facebook itself puts the number even higher, around 87 million.) That data was originally siphoned off by Aleksandr Kogan through an app that collected information by consent. However, he shared the collected data with Cambridge Analytica without getting permission from the users, in violation of Facebook’s toothless policies.

The fact that Facebook allowed so much supposedly private data to be captured and improperly used to manipulate people in a presidential election “ignited a conflagration that threatens to engulf [Facebook’s] already tattered reputation.” And rightly so: Cambridge Analytica’s misuse of personal data amounted to deliberate psychological manipulation. The company explicitly sought to identify “people who were less resilient psychologically” so that it could “exacerbate [their] distrust and paranoia” leading up to the U.S. presidential election. Nor was any of this helped by Facebook chief executive Mark Zuckerberg’s five-day delay in publicly addressing—and apologizing for—the company’s failures.

Before the Cambridge Analytica news even died down, a security researcher discovered another breach. This time, it turned out that a third-party app called Nametest had gathered the personal data of 120 million Facebook users and left it exposed for years. Again, Facebook dragged its feet, waiting eight days to even respond to the report.

More recently, Facebook experienced a third data breach that lasted at least 11 days and compromised at least 50 million accounts. What was stolen in that time, and what will be done with that information, remains to be seen. This latest breach, and Facebook’s slow response, has led some to suggest that the social media giant didn’t learn anything from Cambridge Analytica’s misuse of personal data.

But what Facebook has, or hasn’t, learned isn’t the whole story. Starting with the Cambridge Analytica scandal, the legal community has been compelled to take a harder look at how it collects online data—and that has turned out to be an undeniably good thing.


What Cambridge Analytica Revealed About Data Collection

 Facebook has long allowed applications to collect data about users, as Kogan and Nametest both did. In fact, Facebook was careful to explain that Kogan, at least, obtained users’ profile data properly and that there was no actual security breach associated with that data.

This isn’t so unusual; most social media platforms allow some level of data collection and manipulation. Many, if not most, applications use APIs or application program interfaces to allow others to access data and build plug-in applications within an existing platform. Unfortunately, while there were rules governing API use in Facebook, “there was also nothing stopping a developer from breaking those rules—and nothing Facebook could do to easily tell if the rules had been violated.” Therefore, APIs could be used to collect data for whatever purpose, good or bad.

On the good end of the spectrum, companies frequently use APIs to collect their own communication data for compliance monitoring and ediscovery data collections. After all, companies that have social media accounts may find that a substantial amount of their customer communications occur over those channels—which means they must be preserved for legal purposes. APIs offered an easy way to collect that data in a somewhat usable form. To be clear, the data that most companies need to collect is primarily about what they say and do and how they interact with customers and prospective customers for both ediscovery and regulatory compliance purposes.

But between Facebook’s security lapses and the effective date of the EU’s General Data Protection Regulation (GDPR), the spotlight on data privacy has intensified. With all of the recent attention on how companies gather, mine, and use personal data, many platforms have disabled or limited their API access to control external access to private user data. Facebook announced in April that it would limit the user data that developers could access; it then cut API access further in July. Instagram’s API access was similarly limited, as was Twitter’s.

With this deprecation limiting the ability to download data using APIs, what’s a company to do?


What API Deprecation Meant for Ediscovery

The upshot for corporate data collection is that these changes to API access have closed the back door that many companies used for legitimate data collection.

Of course, despite all of the #deletefacebook pressure, people largely haven’t stopped using social media. According to the Pew Research Center, two of every three people in the U.S. got at least some of their news through social media sites in 2017. And customers continue to research companies, asking questions and seeking preliminary information, online. The importance of an online presence, including a social media presence, for customer inquiries means that businesses cannot ignore those channels.

But API deprecation has driven businesses to investigate other methods, chief among them native-format collection and review. This method uses web crawlers to navigate online information and captures it in WARC (Web ARCHive) format—which, it turns out, is vastly superior to API collection.

Why? Uncontrolled data access wasn’t the only weakness of API collection. APIs have always offered a limited way to capture and save online data. When data is collected with an API, it is transformed into a different (and less useful) format. This capture loses all of the original context and any dynamic content, such as mouse-over capabilities or linked content. Further, API collections aren’t necessarily accurate and complete. Plus, as evidenced by their recent shutdown, APIs are entirely controlled by the external platform, leaving businesses that rely on them for data collection at the platform’s mercy.

Native format avoids each of those shortcomings.


The Advantages of Native-Format Web Collection

Businesses using native-format collection methods can preserve the full dynamic context and content of a website. Imagine navigating an archived, safely preserved social media conversation or company website as if it were live, viewing all comments and reactions, clicking through associated links, sorting and manipulating results, playing back video, and freely exploring interactive components such as drop-down menus and content carousels. These collection and review capabilities have never been possible with APIs. And how better to demonstrate to a court or regulatory agency how your conversation with a customer unfolded than by showing how a user would have experienced your site in real time?

Native-format WARC files are ISO 28500-compliant, meeting the gold standard for web collections. They’re both platform agnostic, so they can be viewed from any system regardless of its hardware configuration, and future proof, ensuring that they’ll work no matter how much software transforms in the years to come.

While Facebook has undeniably had a bad year, the public attention it has brought to API collection has opened the door for a new—and much better—web preservation method. If you’re ready to learn more, contact Hanzo today.

Flash Survey: Social Media Collections

We'd like to hear from you about your experience with collecting from APIs.Please take this one-minute flash survey to help us identify the biggest pain points with social media collections, you can even earn a $10 Amazon gift card. We'll be happy to share the compiled results with you in the future.

Take the survey

Related posts

New Privacy Laws, Data Minimization, and Challenges with Collaboration Data

New Privacy Laws, Data...

2023 may be the year of privacy laws. Five states have new laws that go into effect this year, which will likely usher ...

Read More >
Hanzo Top 20 Ediscovery & Compliance Blogs of 2022, Part 2

Hanzo Top 20 Ediscovery &...

It’s been another interesting year in the world of legal technology, and we here at Hanzo have covered a variety of ...

Read More >
Hanzo Top 20 Ediscovery & Compliance Blogs of 2022, Part 1

Hanzo Top 20 Ediscovery &...

It’s been another interesting year in the world of legal technology, and we here at Hanzo have covered a variety of ...

Read More >

Get in Touch to Learn More

Hanzo’s purpose-built, best-in-class solutions can help your readiness to respond to the next discovery request, investigation, or audit. Contact us to learn more.

Contact Us