How We Do Data Science at People Pattern

At People Pattern we do a lot of data science. For a small organization, we have a fairly large data science team: five full time data scientists and a cadre of interns. We’ve been in business for over a year, and from its inception People Pattern has considered data science to be a core competency.

So we’ve got some opinions about how to make data science work. Here’s some of them. These opinions aren’t technical or scientific ideas, though we’ve posted those sorts of things before. Instead, this post is about how we plan and do data science, as a team, on a day-by-day basis.

Wanting to skip to the best parts? Here’s what you need to know:

Empirical evaluation is clutch for data science
Keep a lab notebook, you’ll thank us later
When tackling a new or challenging problem, always, always, try the dumb thing first
Talk to your team members, disseminate and collect ideas to solve your problems

Before we begin…

First things first

Firstly, what do we mean by data science? Like “big data” and concepts like it, it feels like everybody doing data science has a different notion of what data science is or should be.

And we do, too. Or anyway, we have a pretty specific domain of data science we operate in. Specifically, we apply machine learning and statistical natural language processing to social media and customer loyalty data to extract insights about our customers’ audience. More concretely: we predict demographics (age, race, location, gender) from users’ names and descriptions, we predict interests based on social conversations, we discover natural groupings of interests within a community (i.e. personas), identify interesting key words and phrases characterizing persons or interest groups, predict sentiment, and several other things besides. It’s a lot, but there’s a world of data science applications and techniques we don’t touch at all — like genomics research or trying to analyze 1 million gigabytes of information per second coming out of the Large Hadron Accelerator.

We write production code

One more point of background: we’re building a company and a product, it has to work and it has to work at scale — we are not building a proof-of-concept or experimenting for the sake of it. So, when we build data science-oriented products and services, we focus on interfaces and implementations which will service our audience dashboard product in a timely fashion. Not to say we’re optimizing right out of the gate, which is the root of all evil, but the ultimate requirement of our data science work is to operate in an online SaaS product, so runtime performance is important.

Data science = greenfield development

Having said all that, we’ve found that data science products (our own at least) are almost by definition greenfield projects. We’re using machine learning or statistical analysis to extract original insight from a relatively new data source (social media data), so we’re working on problems that haven’t been worked on before. There aren’t existing solutions or products for these kinds of things.

Moreover, most of the ideas we try to solve these kinds of problems don’t work — which of course is completely expected. With any new problem, you may try one of a dozen approaches — for us, we’ll treat something as a classification problem, or as a clustering problem, or as a semi-supervised learning problem, or as a latent variable problem, or perhaps simply as a search problem. Which approach is the best approach is a matter of exploration and experimentation. As such, this sort of development, trying out new ideas which are likely-as-not to get tossed out tomorrow, probably should not be integrated into the main branch of your product’s codebase.

We embrace prototypes

Thinking of data science products as greenfield projects, it follows that you start with a prototype. And this is good: starting fresh with a clean slate, the scientist engineer has great freedom to experiment with ways of framing the data science problems and techniques to solve it. Unencumbered from other infrastructure, prototypes allow the data scientist to get insight into the data relatively quickly, produce some graphs or numeric tables characterizing what the problem is like, and try out solutions. A prototype is a good starting point for a simple demo, which helps a data scientist or engineer share his or her ideas with other team members, and get some visual insights into the technique being evaluated. And more than anything, starting with a greenfield prototype enables an engineer to pursue what Nathan Marz calls “Suffering-oriented programming” in its purest form: “First make it possible. Then make it beautiful. Then make it fast.”

We embrace API microservices

We, start from prototypes but we need to get to heavy-lifting production code reasonably quickly. We’ve pursued various strategies for this in the past, but lately we’ve come to embrace API microservices (especially RESTful Web services). At best, we try to design our APIs around the “Unix philosophy” — “Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features.” There are many reasons to pursue API-driven architecture in software development, see in particular Steve Yegge’s excellent and accidentally-published Google Platforms Rant, and APIs: A Strategy Guide by Jacobson, Brail and Woods (here’s a sample chapter).

For us, developing microservices in isolated code repositories with minimal dependencies actually provides the simplest path from prototype to production. It allows the data scientist and engineer to focus on the essence of the feature being developed, and hew closest to the best implementation developed in the prototype. Also, in data science and machine learning, the best tool for a certain problem may be a library written in Python, or in Go, or C++, or Ruby, or Scala, or Java — we’ve found it easier and more pleasing to build a RESTful Web service using (say) Python around an excellent Python scientific library, than to implement a scientific library in (say) Java to integrate with a framework we may already be using in production. Even when we craft our own custom solutions similar considerations apply: a given algorithm may have a simpler, more elegant, and more maintainable implementation in one language vs another. Using networked API services as a primary interface between product functionality allows us to use the best language, library or utility for the job.

Empirical evaluation is clutch

We embrace prototypes, we’re constantly testing and improving our ideas, and we’re constantly moving ideas from prototype to production. This can be risky, of course, and we work hard to do empirical evaluation of our techniques throughout. This means when we start work on a feature, we curate a data set for evaluation and we keep it present to track improvements on the feature throughout its development. When new use cases or edge cases are discovered for a feature, we create new evaluation data sets and for those as well.

Fundamentally, lots of factors go into the quality of a data science product, not just code quality: also training data, background data and resources (like dictionaries, databases and the like). Monitoring and maintaining data for empirical evaluation

For us, these evaluation datasets usually correspond to test sets for our machine learning classifiers. If you’re familiar with classifier evaluation, you may be familiar with cross-validation. Using cross-validation means you can skip the work of curating and maintaining evaluation datasets, but still have a justifiable evaluation of your classifier’s performance.

We don’t like cross-validation. It’s true, it can be useful to check that your classifier is sane, especially when you’re trying something for the first time. But: one of the simplest and best ways to improve a machine-learning classifier is to simply add good annotations to the training dataset. Under cross-validation, the training dataset and the testing dataset are the same so you can’t reliably see how the additional training data affects a classifier. Better to build evaluation datasets you know, you have confidence in, and track improvements on those.

We keep lab notebooks (← this is the most important point of the blog post)

When developing new features, training and evaluating new classifiers or analyses, we track progress. We track changes in training and evaluation datasets, we track changes in the algorithms and models we use, we track changes in the overall settings (or hyperparameters) we apply to our models. We try out ideas, see if they work, throw them out if they don’t. With all this comes risks. You might throw out a good idea prematurely. You might run ten experiments in quick succession, leave for the day, then the next morning forget precisely which settings produced the best results.

When doing empirical data science work — in industry or academics — keep a lab notebook. I cannot emphasize this enough. Keep a running document of what experiments you’re running, what the input data is, what the results are, where you’re archiving any artifacts output by the experimental process.

There may be dozens of technologies to facilitate this — from notes sharing applications to specialized databases for archiving experiments and results — but we started with the simplest thing that could possibly work, and so far it’s worked well. We maintain a text document (OK really Markdown, which is still basically a plain text) along with our experiment running code in our prototype code repositories. Often the prototype code grows into an API microservice and we carry the lab notebook along with the project. The simple, free form of plain text encourages the data scientist to extemporaneously ramble about ideas, observations or hunches, which are exactly what we want to capture when developing new machine learning-based applications. More, these are exactly the kinds of things that get lost in more formal results tracking applications, even say a spreadsheet.

We organize our lab notebooks a little like a blog: entries are organized first and foremost by date, and the content can really range between extemporaneous observations to results tables to links to spreadsheets and graphs and so on.

Lab notebooks are so great. Without them, it’s genuinely really hard for a data scientist to pick up where he or she left off in an experimental project, even if it’s only been a day or two since she/he was working on it. Often experimental work takes time — days or weeks to fully evaluate a new idea — and it’s hard to illustrate progress on these things. Too often product managers have to hear that a feature is “under development” for weeks with little by way of deeper insight. Our lab notebooks help our data scientists track their own progress working through a tough puzzle, as well as explain to our (excellent) non-technical personnel what we’re working on, how we’re approaching it, and what progress we’ve made, even before it’s ready for production.

We do the dumb thing first (← this is the second most important point)

I’ve omitted technical or scientific details about the machine learning techniques we like to use at People Pattern, and solutions we’ve investigated and deployed in house. But I will say this: when starting to tackle a new problem, we have a rule:

Try the dumb thing first.

We’ll often be surprised at how well a surprisingly simple technique works, when compared against heavy-hitting statistical inference. Especially for difficult problems and noisy data. Firstly, you have to make sure whatever statistical technique you pursue works better than the dumb baseline. Secondly, you may eventually find that the simple technique implemented in a couple hours works better or almost identically to the complex model. You may find it’s simpler and easier to maintain, it’s easier to explain to customers, and it’s faster in production than the alternative. If these are the case, then it’s clear that the simpler solution is the right solution.

In contrast, consider this piece’s comments on “marks of real data science”: Look for Algorithms, Not Queries. This is “first level of distinction between real and pseudo data science”.

Seriously? No. While we use a lot of machine learning algorithms at People Pattern, very often we find that a query is exactly the right solution for a problem. Real data scientists (to use the article’s obnoxious phrasing) figure out when to use a query and when to develop a machine learning algorithm; or when to use a query to improve a machine learning technique. Fundamentally, results are what matter, not technique.

There’s simply no a priori reason to suppose machine learning algorithms are the right or correct or only solution to a problem. Even Google, for one, does not use machine learning for ranking search results even though there is a large body of research on using machine learning to rank search results.

We talk shop during our meetings

Finally, we have weekly meetings and we go around to discuss what we’ve been working on. Not to keep managers in the loop or to let PMs urge developers to get features done faster, but because it gives us a chance to talk shop — ask around about different approaches, what libraries and techniques we’ve been experimenting with, how to crowdsource annotations of information we could use to improve our classifiers, how to best evaluate our ideas, and the like. At it’s best, data science is a social activity and we make a point of making time in our week to keep it as such.

Wrapping it up

All of this is to outline how we at People Pattern practice data science. Organized as such, it feels a little like a laundry list, but living it day-to-day it all fits together: when we start a new data science problem, we basically treat it as a greenfield project and set up a prototype. Prototypes are nice in that they allow the data scientist or engineer to focus directly and specifically on the problem being addressed, and when they’re built into production code we often keep them isolated still as API microservices. Building production data science products involves loads of experimentation and back-and-forth, and we’d be lost if we didn’t maintain lab notebooks as we go. But before putting effort in on a complex solution to a data science problem, we make sure to try the dumb thing first: it helps to understand a problem and establish expectations about the solution. Finally, having a team of data scientists around as we do is huge: we often try a number of approaches to each of the problems we tackle, and it is invaluable having others to vet and improve the approaches, as well as suggest others.

Request a trial of the People Pattern platform here.