A Tech Intro to Data Portability

Ross Schulman

June 15, 2018

Recently, OTI hosted an event on online services and data portability. Keynoted by Rep. David N. Cicilline (D-R.I.), the event then featured a panel discussion and rigorous debate about whether and to what extent people should be able to take their data out of one online service and upload it into another service of their choice. The conversation dove into some fairly technical details surrounding protocols, APIs, and data formats. Given the technical nature of many of these concepts, we felt it was important to make sure that everyone listening to the discussion was using the same basic definitions of these terms. For anyone interested in this issue who couldn’t make it to the panel, I’ve reproducing the script and the slides here in this blog post. If you want to see the original talk, archived video of the whole event is available.

----

Data portability has been a big topic of discussion in the wake of the Cambridge Analytica scandal, with users asking whether they truly control their data or not. It also has been a hot topic because the new EU privacy law—the General Data Protection Regulation or GDPR—requires companies to offer data portability. But what is it? Well, to paraphrase the GDPR...

Data Portability is the ability of a user of an online service to extract an archive of the data they’ve provided to or stored with that service, in a structured, commonly used and machine-readable format, suitable for transfer to a different service of that person’s choosing.

Today, for example, Google Takeout gives you the ability to select which Google services you want to export data from, and lets you choose which format to receive them in. Here, I’m downloading my contacts database in the vCard format.

Twitter, meanwhile, gives you a full archive whenever you ask. The archive in the lower screenshot, contains both a convenient human-readable web page of all your exported tweets, as well as copies in two machine-readable formats, CSV and JSON.

Facebook’s process is similar to Google’s. You can select what types of data you want to download and what format to receive them in.

It's worth noting that all of these processes, with the possible exception of Google’s, are today much more robust than they were six to eight weeks ago. The GDPR has obviously pushed companies to have a better approach to data portability.

Meanwhile, Google, Microsoft, and other contributors are working on an open source project called the Data Transfer Project that is trying to develop a simple common interface for moving files directly between services. For example, in this demo screenshot, a user is moving their photos directly from Google’s photo service to Microsoft’s photo storage service.

I’ve referenced a few times now that a key feature of effective portability is that the data be in a common machine-readable format, by which I mean...

A machine-readable format is a file format, preferably based on an open and widely used standard, that structures data in such a way as to be easily parsable and modifiable by a range of computer systems, thereby making it easy to move the data between different services. Common examples of widely used open standards for structuring data in a machine readable way include JSON and XML.

For example, until recently, Facebook’s download-your-data tool only allowed you to download your content in the form of HTML archives optimized for private viewing, rather than in a more structured format suitable for easier transfer to another service. But about a month ago, it also began offering in JSON, presumably as part of its GDPR compliance. Here’s an excerpt from my downloaded data in JSON format:

It’s important to note that what I can download right now from Facebook is strictly my data—the content that I have posted to Facebook—and doesn’t include, for example, photos or other posts in which my friends have tagged me, and it doesn’t include all of my friends contact information such that I could easily reconnect with them on a different service. This isn’t just Facebook—similarly, Twitter lets you export the tweets you authored, but not your mentions or lists of likes or retweets. There are privacy arguments for why some of this is the case, but it also raises competition concerns.

Back to machine-readable formats for a moment: Activity Stream is one example of an open standard for social media activity that uses JSON.

It defines a format for storing items such as posts, likes, comments, etc. in a stream similar to your Facebook feed or Twitter feed. This example shows a “follow”-type action; it represents the fact that Brian here followed Ken.

We call Activity Stream an open standard because it was developed at the World Wide Web consortium and anyone can use it. So far, none of the major commercial social networks are offering their downloads using this standard, even though several of them participated in developing the standard. Instead, it’s mostly used by open source decentralized alternatives like the social network software Mastodon—just one example of the kinds of alternatives that might be able to grow and compete with widespread data portability.

You may be asking now what it means to describe an internet technology as decentralized.

Decentralized information technology is a technology that relies on open standards such that users can make use of the technology and communicate with others using the technology without having to rely on a single service provider. Email, the web, and (once upon a time) instant messaging are all decentralized technology.

So, for example, both email and the world wide web are decentralized technologies based on open standards such that anyone can run an email server that talks to other email servers or send and receive emails from someone using another email service, and anyone can run a web server that serves content to any web browser and can link to content on any other site. Similarly, anyone can run a Mastodon server that hosts social network users, and those servers can easily talk to other Mastodon users on other servers. In other words, decentralized technologies are easily interoperable.

Interoperability is the ability of different computer systems or software to exchange and make use of information across systems in an ongoing way.

If you think of portability is a one-time copying of all your data, think of interoperability as the ongoing ability to interact across services. In open systems, this is pretty straightforward: it’s very easy for one web site to pull data from another or link to another; it’s very easy to email across services; and back when much of our instant messaging activity was based on the open standard XMPP, it was very easy to chat across different chat services, including Google Chat and Microsoft Messenger, along with many other XMPP servers. However, when it comes to closed platforms like Facebook that have been built on top of the open system of the internet—what some folks might call walled gardens—interoperability, when it exists, is typically accomplished through what we call Application Programming Interfaces, or APIs.

APIs (Application Programming Interfaces) are interfaces between different software applications allowing them to talk to each other and exchange data in a specifically defined way.

APIs are defined ways in which one piece of software allows other programs to interact with it. They can be totally private or completely open, or anywhere in between. They’re often used in open systems with little restriction. One example is a a weather API that takes in a zip code and produces a weather report.

However, they are also used to allow and regulate access to data and users in closed systems—think of them as windows or doors into the walled garden. For example, many Google services have APIs that allow access to data including calendar items, but they are of course guarded behind authentication systems. Twitter’s API provides a means to search tweets, post new tweets, and manage advertising campaigns. And Facebook has APIs that Facebook apps and connected web sites use to access data about Facebook users. Indeed, it was Graph 1.0—the version of the Facebook platform API that was in use before 2015, which allowed apps to not only obtain data about their users but about the friends of users—that led to the Cambridge Analytica controversy in the first place.

All of these definitions lead back to a core question: how, if at all, can we ensure enough portability and interoperability to promote competition and innovation and avoid locking in the dominance of existing platforms, while also adequately protecting privacy?

A Tech Intro to Data Portability

Blog Post

Pexels

Ross Schulman

June 15, 2018

Related Topics