Marketers and consumers can agree on one thing. Anonymity can be a wonderful thing. Even if a marketer wants to tailor the right message to the right audience they don't need to know who you are to achieve their goals. Consumers also believe that a marketer shouldn't get too personal. Why? They want to ensure that their browsing and shopping behaviors on the Internet will not be used to serve them ads that might be embarrassing (diet ads) or be used to exclude them from product or service eligibility. If we could restore anonymity and remove the 'you' from data collection can we achieve a happy medium of protected privacy AND relevant ads? I believe so.
In the offline world we have examples of the successful use of anonymity. One such example is in the database marketing world where we have data sharing that sometimes occurs at the zip+4 level as a compromise between household and zipcode collection. Data sharing at the zip level is too coarse. Data sharing at the household level can be too personal. Zip+4 is a compromise between these two extremes. A zip+4 target is usually a cluster of about 7 homes (give or take). Because of laws around financial information, some finance or transaction data sets have been 'anonymized' to the zip+4 level before they are shared. An example would be: people in this zip+4 cluster utilize dry cleaning services.
The point of all this, is that laws in the offline world have been constructed in a way that makes trade-offs in a range between total anonymity and very personal. Why didn't we go this way online? After all, most cookie targeting performed by ad servers isn't based on knowing who you actually are (i.e. name, email, number, or identifies that ties you to a real-world identity).
I can't tell you exactly when the image of anonymity online died, but I can say we lost a lot when we stopped pursuing anonymity. The first setback occurred when AOL released its search logs. As soon as the logs were released, researchers made use of the fact that some people 'Google' themselves to re-identify people. Similarly when Netflix released its movie viewing habits, "researchers demonstrated that an attacker who knows a nontrivial amount about a target individual subscriber's movie viewing habits can potentially identify the subscriber's record if it is present in the Netflix data set."
After these two events, researchers started publishing various pieces about "click-prints" and other concepts which made it popular to claim that even if they don't have ANY data saying who you are (i.e., name, number, email, social security number), they COULD theoretically identify you. The ad industry used to make a distinction between personally identifiable information and anonymous targeting. That distinction is no longer being made. Every time Facebook or some entity with a strong ID does something with your data, privacy advocates talk about that in the same breath as they talk about the non-personal cookie targeting.
Rather than throwing out the concept of anonymity, why don't we go the opposite way? Why don't we create data sharing techniques that are the equivalent of zip+4? At BlueKai, we have started to support research on online anonymity. We have generally pursued two well-understood anonymity approaches that have been utilized in other domains. The first technique is called k-anonymity has been used extensively in the medical domain. Doctors and researchers need to share data in order to be able to find cures for disease. Patients, on the other hand, have good reason to want to keep their ailments to themselves and HIPAA was constructed to protect this right.
In order to satisfy both constraints researchers use k-anonymity algorithms to de-identify patients' records in a way that makes it safe to share the data. In particular k-anonymity is a technique whereby attributes are dropped from any record to make sure that each anonymous individual resides in a grouping of others that is at least as large as k-people. The result is that there is no concept of an individual shared - only a group of people.
The next technique is called noise injection, which destroys some of the data at the source so that it can never be accurately reproduced and therefore never accurately tied to an individual. An extreme example of noise injection would be to randomly associate a set of attributes to every cookie regardless of behavior (for example male or female). A variant of this is called smart injection where modeled behavior is injected in place of random noise. In order to evaluate the impact of noise on the usefulness of targeting we have done some research which compares the conversions that occur when targeting an ad to a group of people based on:
- randomly generated behavior
- precise travel behaviors or
- smart noise where a precise travel behavior in one city is substituted with a modeled behavior that did not occur but is statistically similar.
As part of this study, we found that precise targeting resulted in 70% more travel conversions than random targeting but only 10% more conversions than the smart noise targeting. What this shows is that it may be possible to eliminate privacy concerns faster then we erode targeting accuracy.
All these techniques perform a trade-off between accuracy (wanted by the advertiser) and degree of "individual-ness" which consumers want to minimize. The point of anonymity research is to provide an extra layer of protection for consumers. If we could agree on a standard of data collection that is accurate enough for the marketer and safe enough for the consumer, I'm more than confident that we can reach our happy medium.