Privacy by Design
Meditating on what it would take to design information systems that gave third parties only the information they needed to operate.
I have been using CalyxOS on my phone for a few years now (because in my opinion a de-Googled, hardened Android variant is less of a liability than Apple), and while it manages to strike a decent balance between data protection and functionality, I have recently noticed a matter that, while not terribly consequential, still strikes me as a bit silly.
I should note that I’m not sure if the behaviour I am about to describe is proper to Calyx or AOSP (Android Open-Source Project). The exact attribution doesn’t really matter in this instance, notwithstanding needing to know where to send a patch.
In the last decade or so, attention turned to studies that showed some evidence that visible light at the higher end of the spectrum (that is to say, blue) was linked to disrupted sleep, and it became fashionable for software vendors to create functionality that at night would shift the hue of the display out of the blue region, such that blue moves to green, and white looks roughly amber. (Whether it genuinely helps with getting to sleep or not, it’s still a nice effect.) If given geographical coordinates, this clever software could calculate the appropriate time of day to enter and exit this state, by matching it to sunrise and sunset that day, wherever you were on the planet. Initially, with the pioneering app called F.lux, you would have to tell it where you were, because ordinary computers don’t (or at least didn’t) have built-in GPS.
As the inevitable Sherlocking of useful functionality goes apace, we see it show up in operating systems, both on desktops and in smartphones, which do have built-in geolocation hardware as a rule. So why not make the screen-tinting gizmo glean your location in real time, and save you the step of telling it? Well, because access to different resources on a phone—be it data or sensory hardware—is gated through a rather prudent and sophisticated permission system, and furthermore you have the option to shut a lot of these resources right off.
People may or may not be aware that while GPS proper is completely passive and therefore untraceable, it still takes a little while to get a fix. To help speed up the response time, the geolocation capability of a smartphone is augmented by actively querying the phone OS manufacturer over the internet. The net effect is you are leapfrogging over the mobile provider—who gleans roughly where you are as a byproduct of providing your phone service—to notify Google (or Apple, or any app vendor who is interested, which is lots of them), on an ongoing basis, exactly where you are at all times.
This is how I noticed something was up, because I don’t need my phone to know my location except when I do, I tend to just leave location services off. What I noticed was that the phone’s screen-dimming feature would often be wrong: day mode at night, and night mode in the day, or just stuck on one or the other—whichever I had last manually set it to—for days at a time. What this implies to me is that it won’t change unless it has constant access to the geolocation service, because it does the right thing when I finally turn it back on.
Now, I submit that any software functionality (short of pure scientific applications) that tries to track the position of the sun doesn’t need to do it with any more than to-the-minute accuracy, because a more precise value isn’t going to be very useful. On any day in any given location, you’re bound to have buildings, mountains, clouds, etc., that are going to cause the actual value on the ground to diverge a bit from the calculated one. For this specific application, you could probably even go chunkier than one minute, because it really doesn’t matter that your phone turns the blue channel back on exactly when the sun comes up.
Indeed, you probably want to pin that event to when you are supposed to get up, not the sun, which has little to do with your geographic location.
Now, I am currently in Toronto, Canada, which is about 43 degrees north. The earth turns 15 arc-minutes (a quarter of a degree) per, uh, time-minute. The coordinates that the functionality actually needs to do its job—at this latitude, mind you; it will get tighter the farther you get from the equator—can be anywhere in an ellipsoid about 20 kilometres in diameter. What that means is, as long as I’m within ten clicks of the centroid of the city of Toronto—which I am—those coordinates will be perfectly adequate to tune this part of the system’s behaviour to minute-granular accuracy.
It just occurred to me that if you were sufficiently far away from the equator, you’d probably want to give this thing artificial timing instructions for a good part of the year anyway.
So that settles the matter of what to tell this software when it asks, but recall that it refuses to work at all unless it has ongoing access to geolocation data. This I consider to be a straight-up design flaw. Unlike its third-party counterparts, this software is part of the OS, so it isn’t like it’s phoning that data home to the mothership. As such, it only needs my coordinates (or as discussed, anywhere within ten kilometres) whenever I make a significant move, and that only happens once in a while. Even if I do venture farther afield, most of the time I’m probably going to be back before any change in configuration would be meaningful. About the only time it is meaningful is when I get off a plane, at which point I only need to flip on my GPS (and only the GPS, not the active location querying) for just a moment, to update my coordinates to the nearest major landmark.
I’m sure this is a benign oversight on the part of whoever wrote this functionality, and it can be easily fixed: just store the last coordinates you saw and continue to use those until the user updates them. Don’t simply refuse to work because you can’t get updated data right now. If the information you get is inaccurate, that’s up to the user to fix, but you don’t need to go on strike. That’s just annoying.
I think I’ll save the rant about how much of a pain it is to make minor contributions to big open-source projects for another missive: from downloading mountains of source code, to setting up a build environment I otherwise wouldn’t use, to poring over documentation I’d otherwise never read, to messing with programming languages I’d rather not touch, to round after round of compilation and debugging, to cajoling maintainers into merging my patches.
Assume Every App is Hostile
While I’m pretty confident the screen-tinting thing can be fixed locally (relative to itself), a more systemic approach would be to consider a regime of decoy data sources for untrustworthy apps—which is nearly all of them.
I should note that faking your location in particular is something that Calyx (at least) will let you do, but in my experience it’s annoying to use and not very well-represented in the phone’s interface. That is, there’s only “on” and “off”, there’s no “tentatively on” or “on but paranoid about it”. (That’s an additional setting which to Calyx’s credit defaults to off, though it isn’t clear you’re making a privacy tradeoff.)
Likewise, you can control which apps get access to location services, and you can control whether or not the app in question is permitted the high-resolution signal, but you can’t explicitly set the radius of the low-resolution signal, say, to ten kilometres.
Even if, however, I could control my artificially-hobbled location signal on a per-app basis, I don’t think I’m comfortable allowing apps to take too many readings over time. The thinking here is that some downstream algorithm will draw some conclusion or other about any signal it has enough data for, even if that data is fake. So in addition to feeding the robots garbage, we also want to starve them. In this sense I’d like to be able to establish a quota so that a given app could only take a location reading say, once a day, instead of however often it likes. What I’d want to do here is indicate to the app that the location service is off most of the time (irrespective of its actual status). From what I can tell, there are event listeners that notify an app when the location service is toggled, and that’s a global thing. Fooling the app—though this is just a barely-educated guess—would probably entail a major overhaul of Android’s internals.
My instinct is that overhauling the location service to assume apps are potentially hostile would be a non-trivial undertaking, and there are many other sensitive subsystems besides. WhatsApp, like every other Facebook property, ploughs a shitload of effort into gleaning every last binary digit of data it can from every orifice you let it get close to. One notable pathology is that it will no longer let you send a message unless you give it access to your contacts. (You can respond to messages other people send you, but you can’t send one fresh without acquiescing.) This make sense, right? How else are you going to message people? Except WhatsApp keys off of phone numbers, and they removed the ability to just enter one by hand.
Now, of course WhatsApp is going to learn that Alice knows Bob if Alice uses it to send Bob a message: there’s no other way for it to route the message otherwise. But, it is in no way essential to the process of Alice messaging Bob that WhatsApp be told that Alice also knows Charlie. Another way to say it is that WhatsApp holds its basic functionality for ransom, payable in data: in order to send a message, you have to submit to letting it scrape your entire Rolodex. This deserves an intervention.
What this situation needs is a way to feed WhatsApp a decoy contact list that only contains the people (i.e., the ones who refuse to use Signal) you talk to on WhatsApp. Again, there is sort of a way to do this on Android (you enable the “work profile”) but it’s clunky and not conducive to a smooth experience.
This may all be for naught, because you can bet your contacts permitted WhatsApp to read their contacts, which necessarily contains you. Also worth noting: WhatsApp used to deny signups from bogus Twilio numbers, but they have since relented, which implies to me that if they didn’t just stop caring, they are getting access to the real underlying phone number some other way.
Again, even though smartphones’ app permission systems are impressively fine-grained, there is still no way to dictate what an app does with the resource to which it is granted permission. Permission to see your location or contacts plus permission to access the network equals permission to smuggle this information offsite. Both of these parcels of information are small (location being about as small as you can get) and thus can be swept away in milliseconds, are effectively non-repudiable, and impossible to claw back. Whatever the app vendors want with this data, they have it in their possession long before you can assess the consequences. Standard operating procedure should be to treat every mobile app as hostile, and only feed it the bare minimum information it absolutely needs to carry out its function.
You’re Just a Luddite
My motivation for being a stickler about data privacy is not paranoia per se, although let’s consider the risks:
The vendor abuses your data, including selling your data to other entities who abuse your data,
the vendor gets hacked, and the hacker definitely abuses your data.
I am less personally interested in scenario #1, because I am less personally interesting, in some measure due to my ongoing prophylactic stance. Furthermore, sold data also tends to be aggregated and thus nominally anonymized, but can be selected by buyers in fine enough grain so as to make the anonymization effectively nil. (We see this in the red-state abortion clinic geolocation data, although again, something that does not personally concern me.)
The whole premise of “anonymized” data is spurious; people deanonymize data for sport. Also when I say “does not personally concern me”, I mean that primarily as a person who is neither a citizen nor resident of the United States. There are enough Americans in the world who are not only more directly affected by that (rather serious) state of affairs, but also, unlike me, they can actually do something about it. For now I’m focusing my attention on the problems with data privacy that hold no matter what jurisdiction you’re in.
What is more worrisome, at least to me, is that while these entities aggregate the data they sell, they absolutely keep everything they know about you, personally, in full detail, in perpetuity. This is a terrifically attractive target not only to hackers, but also to governments, who can use this arrangement to conveniently circumvent any number of civil rights protections, often just for the asking.
Leaving aside the fact that there is a thick dossier of each of our most intimate behaviours out there on a server somewhere just waiting to be abused, my main motivation around this issue is a lot more mundane. Namely, it isn’t clear to me what my data is worth. Aside from hedging against some catastrophic event in the arbitrarily far future that hinged on me at one point clicking Accept, it isn’t clear to me what the cash equivalent is for what I’m getting. I have no way of knowing if I’m getting a good deal or not.
We can think of user data as belonging to two classes:
The data that is strictly necessary to carry out the service or is otherwise a byproduct of using it,
Everything else.
The first problem is that these two classes get mixed together. You install an app on your phone, and it has free rein to do whatever it wants within its own confines, plus whatever access you grant it outside those confines. This includes coercing you into information-sharing relationships with third (fourth, fifth, sixth…) parties: job prospects, law enforcement, insurance companies, financial services, whatever. The question is what down to the individual bit, do they absolutely require access to in order to discharge their service, and what happens if any of those third parties get access to it as well?
The answer in a lot of cases is a minuscule chance of an infinitely bad outcome, which can’t (in good faith) be priced. You have to look at the individual elements of content and then imagine what kind of liability those amount to when a) put together and b) aggregated over time. Are you denied a loan because of it? A job? Admission to a school? Insurance? Are you stalked? Imprisoned? Assassinated? Or, are you robbed, defrauded, or otherwise impersonated, and crimes committed in your guise?
Or, is the system working as intended, subverting you into buy things you don’t need (potentially at inflated prices calibrated to your income) and distracting you from the things you care about?
This, again, is what messes with me: I don’t know how to price that. Again if I bracket out the “minute chance of infinitely bad” stuff to which the only viable response is don’t give them the data at all, I still have to be able to assess—denominated in dollars—what is the worst thing that can happen if I lose control of this piece of information? How much time/money/lost opportunities/life energy is it going to cost me to repair the damage? Again, this hinges on being able to partition between the data that is strictly necessary to conduct the service, and everything else. Why? Because that’s at least a point of departure for some modicum of public discourse.
Why it’s important to distinguish between these classes of data is because the strictly necessary stuff is associated with my benefits, whereas everything else is strictly for their benefit and can therefore be counted—to the extent that it can be counted—as a pure cost to me.
Even still, though, it’s difficult to imagine entire categories of harms stemming from having data “out there”, until somebody actually does something harmful with it. The harms, nevertheless, always take the same shape: some entity knows something about you, or thinks it knows something about you, and because of this, takes some action that causes you harm. The way, moreover, you typically learn about this, is usually after the fact. The probabilities and magnitudes of individual harms are all over the place, which means the aggregate impact is impossible to quantify. So how about a different question: Who benefits?
Even this isn’t totally clear. The two ultimate purposes for collecting all this data are marketing (a large subset of which being advertising) and risk management. It is to get you into an economic relationship, or keep you out of one. How (or if) the ultimate consumers of this information benefit is again all over the map, so we can only reasonably look at the proximate ones—the tech platforms themselves.
Consider the process by which these entities conduct their business:
They make a product 💸
You use the product 💸
They stockpile your data 💸
They sell your data 💰
They also sell ads. 💰
One might be inclined to annotate this list with the cliché “if you aren’t paying you are the product”, but as with various so-called “smart” devices, you are paying and you are the product. So if you want, you can imagine splicing in “you pay for the product 💰” between the first and second bullet.
Developing the product and operating it are pure-cost activities, as is developing the sensorium and storage facility for harvesting your data. The net (again, proximate) value of your data is therefore (your contribution to) all that cost subtracted from what they earn from it. Thing is, even if you can access the company’s financial statements (i.e., it’s on the stock market), there’s no requirement for them to break it down this way. Selling data will just be lumped into “other activities” or something. But that’s beside the point: so many of these companies are in the red and will probably stay there until they disappear in a puff of smoke. The only people who benefit from this activity, then, are the ones who can liquidate their stock before that happens.
Since risk management is mainly a secondary market (modulo things like insurers who make you put a spy dongle in your car), and marketing minus advertising is going to involve tricky things like dynamic pricing (née price discrimination, something only sophisticated entities can take advantage of), let’s focus on advertising itself. Advertisers ultimately pay more for more precise targeting—that’s what drove all this data collection in the first place. The received wisdom goes that you don’t want to waste money advertising to the wrong people, so the more precisely you can target the ads, the better. But is this actually the case? Tim Hwang wrote an entire (short and compelling) book about how it might not. One reason I recall is that what counts as a billable unit is incredibly tenuous (it reduces to measures like how many milliseconds is how many pixels of the ad on the user’s screen), and the fact that online advertising is teeming with fraud. This leaves aside the issue of potential errors in the targeting itself. In other words, when you isolate the actual humans who actually see your message and actually act on it out of the total number of ad impressions you pay for, all that pricey targeting could turn out to be a wash.
I can’t remember if platforms across the board charge more for narrower targeting (probably?) but this is a free newsletter and I don’t feel like exhaustively researching it. I can say, deductively, that untargeted ads are going to be expensive, in part due to waste and in part by competing with the targeted ones. That is, you have to buy that much more bulk to get the same coverage, a cost that targeting naturally eliminates. Sharper targeting is also going to dramatically shrink the addressable inventory, so there is naturally going to be a dynamic like a Laffer curve in there that limits how much you spend. So there is definitely value in some targeting, but it isn’t a self-evident proposition that more precise targeting is automatically better.
The most recent development in data monetization is artificial intelligence: businesses collect oceans of data, burn a bunch of power training models, and then sell access to the models they train. I read this as just them figuring out another way to make money off a resource that is already in their possession. In fact, I’ve written before about how AI hype is a case of the tail wagging the dog—these businesses have data at their disposal, and they have compute that is otherwise under-exploited, so this is just another way to turn that into money. This is why I think a lot of artificial intelligence is rightly characterized as a solution looking for a problem: the models are trained on whatever data happens to be lying around.
This is perhaps an opportunity to conclude on why I am disinclined to participate in what gets called “surveillance capitalism”: the goods aren’t especially good (at least for users; your outlook might be different if you are a non-bagholding shareholder) and the bads are pretty bad. I understand why people say “data is the new oil” on account of it being mined and refined, and I can doubly understand why people like Cory Doctorow say it’s actually like the new plutonium: extremely hazardous except under narrow circumstances, which themselves, if they benefit anybody, only benefit a few people at others’ expense.
Great article, Dorian. One of the few great things that has emerged from the "web3" bubble has been a renewed spurt of development in privacy-centric infrastructure stacks and protocols. I'm particularly a fan of source.network. We're building a decentralized science framework on top of their database: http://bit.ly/dtwins-wp