How Anonymous Is Your Data?

So, Should You Be Worried That We're on a Fast Track to Mass Privacy invasions?

By Published on .

One of the biggest big-data challenges for marketers is how to take the vast amounts of customer information accumulated in the offline world and translate it into bits and bytes for use in the world of online advertising. This process of CRM retargeting, as it's sometimes called, marries the age-old practice of customer-relationship management with the new and sometimes creepy technique of retargeting, best known as the process by which ads for things you thought about buying chase you around the web.  

In the CRM version, instead of your browsing history shaping your online-ad experience, it's your purchase history and other business-critical information collected by a particular company that's doing the work. This is a powerful -- and touchy -- business.

Credit: Athletics
Industry standards require marketers to be very diligent about scrubbing the data used in online ads free of personally identifiable information. But Google "data anonymity" and you'll find a vast landscape of skepticism about just how effective data-anonymization practices are. Much of the suspicion comes from an academic community that has long been pondering the feasibility of data anonymity. A 2010 paper called "Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization" concluded that the faith put in anonymization practices was overstated. Its author, Paul Ohm, now works for the Federal Trade Commission, the government agency that late last year asked nine data brokers for more information on what data they're collecting and what they do with it. It could be the prelude to the sort of legal restrictions the online ad industry is trying to fend off through its own self-regulatory efforts, run by bodies like the Network Advertising Initiative and Digital Advertising Alliance.

So, should you be worried that we're on a fast track to mass privacy invasions? It's a reasonable question given the steady flow of headlines about data breaches and hacks, proof that corporate handling of consumer data is far from a zero-risk endeavor. The answer, after surveying the technical processes and practices and the regulations controlling them: not especially.

What a data firm is allowed to do with a particular consumer's data is limited by whether or not the consumer has a relationship with the brand, said Matt Mobley, chief marketing technology officer at Merkle. Relationship in this context is typically defined as having made a transaction. For consumers in the wild, as Mr. Mobley described those who have no relationship and are mere prospects, "PII and non-PII will never go together." Each group's data are kept on a different set of servers in a different geographical location than are the data on customers who have brand relationships.

For customers who have that relationship, Mr. Mobley said this: Marketers are "trying to create a map of the identity. "They are trying to construct the set of events a consumer will have with a brand, so they can then plan their marketing around it."

This often begins with the use of a third-party data onboarder like LiveRamp, whose role is to create matches between online and offline customer profiles. LiveRamp works with hundreds of websites that require login information -- dating sites,  social networks and e-commerce sites -- which send it hashes of that information that are then matched against CRM data. Imagine this scenario: When Joe Smith logs into Desperate Hearts, an anonymized version of that is passed to LiveRamp's system, which might find that email address amid some CRM data from First National Savings Trust Bank and conclude that it makes sense to put Joe in a segment that's receiving a credit-card offer from that bank. An anonymous cookie is created that's shared with the platform that's activating the data. The industry average for matches is 30% to 40% of a CRM file, according to a paper from LiveRamp and BlueKai.

Travis May, VP-product at LiveRamp, pointed to what information the company is not using as it creates these matches. The list of no-nos includes online behavioral data that describes what kind of websites a user is visiting -- all these data from the publishers is discarded. Nor does LiveRamp store sensitive data, defined by the NAI as Social Security, financial- and insurance-account numbers, precise and real-time geographic information, and medical and health history.

"One of the key privacy aspects of onboarding is that there is a one-way flow of data: We take offline data and enable it to flow online," said Mr. May. "What we don't do -- and no one that I know of does -- is the opposite, which is to take what people are doing online and tie that to someone's name and address."

This is the fear of reidentification that was captured effectively in the case studies presented in Mr. Ohm's paper. In 2006, Netflix sought ways to improve its recommendation engine by offering anyone the chance to play with its data. But in making 100 million or so records public, it allowed researchers to begin to understand the privacy implications of these massive data dumps. A pair from the University of Texas found that knowledge of the precise ratings given to six obscure movies yielded an 84% chance of reidentification of the person. Knowledge of when a person rated any six movies, regardless of their obscurity, yielded a 99% chance. The same year, Mr. Ohm noted, AOL released a bunch of supposedly anonymized search queries that ended up being anything but nameless when The New York Times was able to tie real people to their actual queries.

Here there are harsh lessons about the risks of doing these big-data public-crowdsourcing experiments, which is not something most companies in the online-data space are engaged in. In fact, it's much the opposite.  

Like LiveRamp, Epsilon emphasizes the one-way flow of data and employs layers of security to ensure anonymity. In working with third parties, Epsilon, the data giant that endured a 2011 breach of non-PII, uses a one-way algorithm that encodes match keys that are used to associate PII with cookies. "Only the partner tokens and the coded client data are passed on to the onboarding partners," wrote a spokeswoman in an email. "The client data does not leave the walls of Epsilon."

Another layer of Epsilon security is the obfuscation of segments that are passed through the online ad ecosystem's many intermediaries. Said the spokeswoman, "We want to make sure they don't get access to the data by distributing a recognizable data attribute, so we use meaningless names and values in the whole onboarding and online distribution process."

So how well does this all work? It might be easier to tell once the FTC does its information-culling. In the meantime, we have anecdotal and self-reported evidence. For instance, the NAI's 2012 compliance report found that its members are good about not collecting PII or sensitive data for advertising purposes.

"Nothing is fail-safe, but they're pretty darn effective," said one industry executive. "The issue is not so much whether a researcher at Stanford who has access to hardcore computing power can figure out how to correlate unrelated data points and get back to you. What matters is: Could a business deanonymize this data at scale and use it to target directly to you knowing it's you and use it for things beyond just marketing -- in particular, eligibility" for financial services or health insurance?

Reverse-engineering PII is the Orwellian nightmare vision. Imagine the health insurer denying coverage or charging higher premiums based on a user's online bacon obsession that foretells of future clogged arteries. That's the real risk that seems to be understood by the online-data community, even if it might not make economic sense as a business strategy.

"It's very, very, very hard to track data back to you in any scalable way," said the industry executive who wants to remain anonymous. "Academics and anti-advertising advocates can spend three months with a team of grad students unpacking this stuff and the cost of identifying a single individual could be $100,000. That'll never scale."

In this article:

Comments (0)