Hacking Boston With Consumer-Data Firm Dunnhumby
"Keep Calm and Continue Testing." The Harvard student's T-shirt tagline seemed to encapsulate the mood in the frigid room filled with data crunchers. It was just another drizzly Saturday in the MIT neighborhood of Kendall Square in Cambridge, Mass., where mostly young men gazed at their laptops, observing predictive models parsing data representing grocery-store purchases of things like DVDs and milk. It was just another hackathon.
But this time it was sponsored by a consumer-data firm hoping to foster innovations in data analysis, and perhaps get a jump on the competition when it comes to harvesting potential data-science employees. More and more companies with lots of data to play with are sponsoring hack events to tap into fresh analytical talent.
"It smells like math in here," quipped Malcolm Faulds, head of global marketing at Dunnhumby. The 24-year-old consumer-data company was sponsoring the hack, an 11-hour slog pitting small teams of coders against one another in a contest to come up with the most accurate model for predicting the sales success of several grocery items 26 weeks after launch.
The challenge was that the teams -- a mix mostly of students and tech entrepreneurs, only around five of them women -- were given data representing the first 13 weeks of actual sales. Unlike hacks that culminate with new product designs or software applications, this was a consumer data hack intended to tease out new ways of looking at historical sales data.
My first data hack
"This is my first data hack," said Harvard student William Chen, the Keep Calm T-shirt wearer and a junior studying statistics. Mr. Chen said he joined the hack to try some complicated techniques he's learned in class. He and his teammate, Harvard senior Ye Zhao, a physics and computer-science major, "like to solve puzzles a lot," said Ms. Zhao.
The data they were working with showed product categories like bread or coffee, the number of stores selling the items, number of units sold per week, the number of customers who had purchased the product and the number who had purchased it at least twice. It also revealed customer segments such as "Shoppers on a Budget" and "Family Focused" consumers.
The event was hosted by Hack/Reduce, a nonprofit that's made a home for itself inside the landmark Kendall Boiler and Tank Company Building, a 19th century brick structure. The organization, which gets funding mainly from private sources along with some public dollars, held its first hack in November. It was BYOD: Bring Your Own Dataset.
Dunnhumby supplied the data for its hack -- around 100,000 rows of it. The challenge was "to help predict how well that product is going to do in the future so the suppliers and retailers can adapt their marketing strategy and also their supply model," said Yael Cosset, global CIO of Dunnhumby, noting that CPG brands need to measure product launches as early as possible.
The company intends to run the star algorithms created during the hack on its own large data set to see how they might be incorporated into current forecasting models.
The information provided to the hack participants -- which included both people present at Hack/Reduce and participating virtually from Moscow and beyond -- was generalized to protect personal consumer privacy and maintain brand anonymity. For instance, the data set included only product types rather than brand names, and grouped purchases by store region rather than specific store. Fifty-two students formed 20 teams on location, but there were 111 teams participating all together, the remainder joining remotely from Australia, Canada, China, Hungary, Japan, Netherlands, Portugal, Russia, Singapore and Ukraine and throughout the U.S.
At the end of the hack, which began at 9 a.m. and wound down around 8 p.m., Dunnhumby hoped to better understand shopping behavior and forecast future behaviors. "It's really for us to look for new techniques and models rather than answer a current situation," said Mr. Cosset.
From Russia with algorithms
Around 5 p.m., the room was quiet as people pecked out alterations to code or took swigs from cans of New England staple brew Narragansett.
Michael Rosa, data scientist at True Lens, a Cambridge-area startup providing social-data graphics (a spin on demographics), was among the Narragansett imbibers. He and his team tested the "Random Forest Model," a combination of decision trees, a model that when visualized resembles a family tree. "This is a common approach and we're tweaking it a little," he said.
Meanwhile, thousands of miles away in Moscow, Dmitry Efimov, a hack participant working remotely, topped the leaderboard which was displayed on a large screen throughout the day. The Sophomore Olin Hackers had just leapt ahead several places, too.
The contest employed Kaggle, a platform used in a lot of competitive hacks to rank participants in real-time. Mr. Efimov's leading score was 0.20076. The scores calculated the "root mean square logarithmic error" of the team's guesses, measuring how much each prediction deviated from the correct answer.
"It is really only useful in relative terms," said Ben Popp, director of engineering at Sqrrl, a young startup that's among the early tenants of the Hack/Reduce boiler building. "I don't know offhand whether .2 is a good or bad value. I just know that if one set of guesses is .2 and one is .3, then the .2 guesses are closer to the actual values," he explained.
Sqrrl is a platform for building secure scalable applications involving hundreds of servers and petabytes of information. The company's founders worked with the National Security Agency on its Apache Accumulo project, a highly secure platform for use by the intelligence and defense communities. Sqrrl is based on that technology.
Mr. Popp formed team UpBenAdam along with Sqrrl CTO Adam Fuchs. By the early evening the two were joking about how their "stupid model" was performing better than a more complex approach. "Adam tried the smartest thing we could think of and I tried the dumbest thing we could think of," said Mr. Popp. The "stupid model," as he put it, looked at sales performance on week 13 and extrapolated from there to determine the sales at the end of the period studied. That approach turned out to be the best of their attempts, so the team submitted it as one of their two final models allowed for each team.
But it was no match for the SidPac team's model. With around 20 minutes left the group of three MIT grad students cracked the .2 mark, hitting 0.1977 -- a breakthrough that stunned many in the room. The team was named after their MIT dorm, Sidney-Pacific, a large building housing nearly 700 students on Pacific Street in Cambridge. SidPac brought home the top prize -- $2,500, and bragging rights.