How Web's Biggest Sites Leak Personal Data to Google and Facebook

Study Suggests That Data Leaks Are Pervasive, Occurring on 59% of Sites Tested

By Published on .

New research shows that personal information including names and sometimes even email addresses is routinely passed from the biggest sites on the web to third parties such as Google, ComScore and Facebook.

Conducted by researchers at Stanford University, the study shows how personal information is commonly -- and often unintentionally -- leaked when a username is included as part of a URL or a page title after a user registers to use a site, for example. Third parties embedded in that page could receive the URL -- and, thus, the user's name, which is often easily deduced from a username or user ID -- in a referrer header, or the data informing a website about pages that link to it, explained Jonathan Mayer, lead researcher on the project.

Mr. Mayer and his group looked at 185 of Quantcast's top 250 sites -- sites that allow users to sign in or provide other identifying information, don't require a purchase for sign-up, and that weren't inordinately complex (thus excluding Google, Facebook and Yahoo) -- and used fictitious accounts to create profiles or change user settings. They then examined the referrer headers and other relevant data that resulted from the interactions and searched them for personal information.

According to their findings, a username or user ID was leaked to third parties on 109 websites, or 59% in their sample, and the top five recipients of leaked information were sites operated by ComScore, Google Analytics, Quantcast, Google's DoubleClick ad platform and Facebook.

Google denied that any personal data is intentionally collected or used. "We've never attempted or wanted to parse out personal information in any URL schema provided by a third party site," a spokesperson told Ad Age .

"Frankly this was common knowledge among many computer scientists who have looked at this space. As you look at URLs, you can see your username put in there." said Mr. Mayer, a graduate student in computer science at Stanford who's also a fellow at the law school's Center for Internet and Society. He said that the project was partially inspired by a recent paper that looked at sign-up and interaction with 120 popular sites and found that 56% leaked some form of private information, while 48% leaked a user identifier. The results were reported in aggregate and didn't look at individual sites.

The Stanford findings reveal that some sites passed personal information to dozens of third parties. The photo-sharing site Photobucket, for example, which embeds usernames in many of its URLs and serves ads on most of its pages, sent the researchers' username or user ID to 31 third parties. But the top two leakiest portals were movie reviews site Rotten Tomatoes and mothers' community CafeMom, which sent test usernames or user IDs to 83 and 59 third-party sites, respectively.

"A lot of this was a function of how dynamic the advertising on the website was," said Mr. Mayer, who observed that sites using multiple ad networks or exchanges seemed to leak user information most widely.

Also included in the report were findings that viewing a local ad on Home Depot's site sent the user's first name and email address to 13 analytics services and data providers. And entering the wrong password on the Wall Street Journal's site brought the test user to a page with their email address embedded in the URL and ultimately sent it to seven companies. The Wall Street Journal has published widely on the topic of online privacy as part of its What They Know series.

"We were made aware of a bug and have since corrected the issue," said a spokesperson for Dow Jones, which publishes the Wall Street Journal. "We are continuing to audit the site."

According to Mr. Mayer, the findings are relevant to the ongoing debate about do-not-track regulation since some of its critics contend that tracking is anonymous, and thus, harmless. He noted that there's a mounting body of evidence showing that information leakage is pervasive.

"The claim we're trying to make isn't about evil websites," he said. "It's about the way the web is today. The web is suffused with identity."

In this article:
Most Popular