Industry standards require marketers to be very diligent about
scrubbing the data used in online ads free of personally
identifiable information. But Google "data anonymity" and you'll
find a vast landscape of skepticism about just how effective
data-anonymization practices are. Much of the suspicion comes from
an academic community that has long been pondering the feasibility
of data anonymity. A 2010 paper called "Broken Promises of Privacy:
Responding to the Surprising Failure of Anonymization" concluded
that the faith put in anonymization practices was overstated. Its
author, Paul Ohm, now works for the Federal Trade Commission, the
government agency that late last year asked nine data brokers for
more information on what data they're collecting and what they do
with it. It could be the prelude to the sort of legal restrictions
the online ad industry is trying to fend off through its own
self-regulatory efforts, run by bodies like the Network Advertising
Initiative and Digital Advertising Alliance.
So, should you be worried that we're on a fast track to mass
invasions? It's a reasonable question given the steady flow of
headlines about data breaches and hacks, proof that corporate
handling of consumer data is far from a zero-risk endeavor. The
answer, after surveying the technical processes and practices and
the regulations controlling them: not especially.
What a data firm is allowed to do with a particular consumer's
data is limited by whether or not the consumer has a relationship
with the brand, said Matt Mobley, chief marketing technology
officer at Merkle.
Relationship in this context is typically defined as having made a
transaction. For consumers in the wild, as Mr. Mobley described
those who have no relationship and are mere prospects, "PII and
non-PII will never go together." Each group's data are kept on a
different set of servers in a different geographical location than
are the data on customers who have brand relationships.
For customers who have that relationship, Mr. Mobley said this:
Marketers are "trying to create a map of the identity. "They are
trying to construct the set of events a consumer will have with a
brand, so they can then plan their marketing around it."
This often begins with the use of a third-party data onboarder
like LiveRamp, whose role is to create matches between online and
offline customer profiles. LiveRamp works with hundreds of websites
that require login information -- dating sites, social
networks and e-commerce sites -- which send it hashes of that
information that are then matched against CRM data. Imagine this
scenario: When Joe Smith logs into Desperate Hearts, an anonymized
version of that is passed to LiveRamp's system, which might find
that email address amid some CRM data from First National Savings
Trust Bank and conclude that it makes sense to put Joe in a segment
that's receiving a credit-card offer from that bank. An anonymous
cookie is created that's shared with the platform that's activating
the data. The industry average for matches is 30% to 40% of a CRM
file, according to a paper from LiveRamp and BlueKai.
Travis May, VP-product at LiveRamp, pointed to what information
the company is not using as it creates these matches. The list of
no-nos includes online behavioral data that describes what kind of
websites a user is visiting -- all these data from the publishers
is discarded. Nor does LiveRamp store sensitive data, defined by
the NAI as Social Security, financial- and insurance-account
numbers, precise and real-time geographic information, and medical
and health history.
"One of the key privacy aspects of onboarding is that there is a
one-way flow of data: We take offline data and enable it to flow
online," said Mr. May. "What we don't do -- and no one that I know
of does -- is the opposite, which is to take what people are doing
online and tie that to someone's name and address."
This is the fear of reidentification that was captured
effectively in the case studies presented in Mr. Ohm's paper. In
2006, Netflix sought ways to improve its recommendation engine by
offering anyone the chance to play with its data. But in making 100
million or so records public, it allowed researchers to begin to
understand the privacy implications of these massive data dumps. A
pair from the University of Texas found that knowledge of the
precise ratings given to six obscure movies yielded an 84% chance
of reidentification of the person. Knowledge of when a person rated
any six movies, regardless of their obscurity, yielded a 99%
chance. The same year, Mr. Ohm noted, AOL released a bunch of
supposedly anonymized search queries that ended up being anything
but nameless when The New York Times was able to tie real people to
their actual queries.
Here there are harsh lessons about the risks of doing these
big-data public-crowdsourcing experiments, which is not something
most companies in the online-data space are engaged in. In fact,
it's much the opposite.
Like LiveRamp, Epsilon emphasizes
the one-way flow of data and employs layers of security to ensure
anonymity. In working with third parties, Epsilon, the data giant
that endured a 2011 breach of non-PII, uses a one-way algorithm
that encodes match keys that are used to associate PII with
cookies. "Only the partner tokens and the coded client data are
passed on to the onboarding partners," wrote a spokeswoman in an
email. "The client data does not leave the walls of Epsilon."
Another layer of Epsilon security is the obfuscation of segments
that are passed through the online ad ecosystem's many
intermediaries. Said the spokeswoman, "We want to make sure they
don't get access to the data by distributing a recognizable data
attribute, so we use meaningless names and values in the whole
onboarding and online distribution process."
So how well does this all work? It might be easier to tell once
the FTC does its information-culling. In the meantime, we have
anecdotal and self-reported evidence. For instance, the NAI's 2012
compliance report found that its members are good about not
collecting PII or sensitive data for advertising purposes.
"Nothing is fail-safe, but they're pretty darn effective," said
one industry executive. "The issue is not so much whether a
researcher at Stanford who has access to hardcore computing power
can figure out how to correlate unrelated data points and get back
to you. What matters is: Could a business deanonymize this data at
scale and use it to target directly to you knowing it's you and use
it for things beyond just marketing -- in particular, eligibility"
for financial services or health insurance?
Reverse-engineering PII is the Orwellian nightmare vision.
Imagine the health insurer denying coverage or charging higher
premiums based on a user's online bacon obsession that foretells of
future clogged arteries. That's the real risk that seems to be
understood by the online-data community, even if it might not make
economic sense as a business strategy.
"It's very, very, very hard to track data back to you in any
scalable way," said the industry executive who wants to remain
anonymous. "Academics and anti-advertising advocates can spend
three months with a team of grad students unpacking this stuff and
the cost of identifying a single individual could be $100,000.
That'll never scale."