People Pattern Swaps Big Data Stories at Data Day Texas

People Pattern Swaps Big Data Stories at Data Day Texas

In early January, Austin played host to hundreds of data geeks from around the United States for the Third Annual Data Day Texas at The University of Texas’ AT&T Executive Education and Conference Center. Data Day Texas is a day packed with speakers and presentations focused on networking and exploring the latest innovations in infrastructure, resources and methods of data storage, access and analysis.

Paco Nathan kicked off the morning with a recap of the year in data science, which nicely included some historical views on machine learning throughout the years. This is actually a really important sub-message to deliver to the data-oriented folks of today: machine learning isn’t what was always done, and it actually took a while to prove it out, to have the computing and data resources as well as the algorithms before it could dominate previous rule-based systems. There are also cultural factors as well: for example, in natural language processing, the Bell Labs group made major breakthroughs using machine learning for speech recognition in the early 1980’s, but it took a while for the text processing crowd to catch on. (It was in no small part due to DARPA telling text-oriented NLP researchers that they needed to team up with machine learning types if they wanted to keep their funding.) By the mid 1990’s, after the field figured out that support vector machines and logistic regression (maxent) were amazingly useful, there was roughly a decade of re-conceiving many previous rule-based systems as learning-based systems. There weren’t many open source packages in the 90’s, so one usually had to dive in and write their own. My OpenNLP Maxent package (old site here) was probably the first commercial friendly open source (LGPL-licensed) maxent package, and for those of us who were working in this space before 2000, it is incredible now to see the plethora of polished, high quality open source packages for doing all sorts of machine learning. The barrier to entry is ridiculously low compared to what it was, and that is a very good thing indeed!

Dean Wampler talked about Scala’s role in the big data world today, with an emphasis on the ease of development that Scala libraries like Scalding and Spark provide over the Java API for Hadoop. A key point for Dean is that this enables programmers to essentially write scripts for performing distributed computations, in contrast to the kind of big programs with lots of special tricks and incantations that become necessary with Java Hadoop programming. This argument resonated with me: I taught a course with Matt Lease on MapReduce in 2011 and enabled students to write code in Java or Scala. At that time, the Scala APIs for using Hadoop were less well developed and Spark was still a little known project, so it is exciting to see Scalding presented as a mature library and the recent surge of momentum behind Spark. It is worth noting that Python is a great alternative because it has the fantastic scikit-learn package (nothing of comparable quality and machine learning coverage exists in the JVM world, to my knowledge) and PySpark is maturing and provides a similarly scriptable interface for distributed computation.

ddtx14-sammer-workshop

Rob Munro, CEO/founder of Idibon, talked about the importance of humans in the loop when it comes to machine learning for natural language processing applications. By being clever with your machine learning model, you can sometimes get a bump of a few points in performance over a less clever model, but by using humans wisely, you can often get massive improvements that dwarf the model-based improvements. This is key to Idibon’s business, which seeks to enable information extraction and NLP for a large set of the world’s languages. It also concurs with my academic research on NLP for low-resource languages. For example, my students Dan Garrette and Jason Mielens and I showed Garrette et. al. (2013) that just four hours of annotation of words with their parts-of-speech was sufficient to get 90% accuracy at the part-of-speech tagging task (and, well, it required some clever techniques too).

Ted Dunning is a veteran of the big data world, who also happened to publish papers in computational linguistics in the early 1990’s (example here). He focused on the very interesting and tricky problem of debugging and modeling (e.g. fraud detection) when you can’t have access to the data your client needs you to analyze. His solution centered around random generation of datasets such that bugs and problems can be diagnosed without access to the original data. The core principle is to create random data sets that break the same way as the true data. Data scientists often create generative models that are used for analysis and prediction, but this was an interesting take on defining generative models in a context where you never get to see the actual data and have only a simple signal that the client can iteratively provide to let you know if you are on the right track. I told Ted that his talk gave me a new angle to motivate my students to do more error analysis for their papers–if you actually CAN look at the data (as one can in academic datasets), you’d be crazy not to do it since the problem is so much harder when you only have a basic response (e.g. accuracy on some development set).

Sandy Ryza gave a talk on “Spark says the darndest things”, which basically was a set of recommendations for deciphering the long stack traces that are delivered to programmers using Spark when things aren’t going right. Often, the errors have no clear connection to the underlying problem, so it’s handy when someone can give you road signs to help you navigate them when you hit them yourself. It also reminded me of my regret in switching from using Python to Scala for teaching classes for new programmers at UT Austin. Scala is great, but it was nowhere near as friendly as Python for newcomers. I once had a student say “Scala hates me!” after it had given her a night filled with long stack traces. That ease for new programmers explains a lot of Python’s ascendance in areas of academia, like biology, that only recently have come to require programming.

Julie Evans and Chris Johnson (who was a student in my 2011 MapReduce course, huzzah!) gave talks on data pipelines at Stripe and Spotify, respectively. Julie gave a very honest discussion of a lot of things that didn’t work well, or not as well as hoped, but with ultimate triumph. At the end, she said something like “as a developer you bang your head at things and never feel quite satisfied with the solutions, but then you turn around and your coworkers think that what you can get done is magic”. It’s a nice take on a problem pretty much all developers struggle with: you know it could be much better if only you could do X and/or Y, or if there was a library that supported Z, and so on. But, you still hopefully end up creating great stuff that amazes others. Chris focused on the transition from a Python stack to a Scala stack for data pipelines at Spotify. They are using Scalding and Spark, and he gave recommendations for tooling around these that is useful for the music recommendation problem. For example, Parquet is a columnar big data format that is especially useful when you need to access only a few attributes of a data item (e.g. attributes of a song), don’t want to shuffle the entire set of attributes around in order to get at them, and are looking for improvements in data compression. (Julie also discussed using Parquet at Stripe.)

In addition to me, People Pattern was represented at the event by Charles Mims, Senior Development Operations Engineer, and Joey Frazee, Director of Engineering. Having three of us at the conference allowed us to spread out and go to different talks and capture a fairly wide swathe of the topics. This was especially handy as there were no breaks during the day. The organizers did this for the right reasons (lots of great talks, and trying to ensure they could all be presented), but it did make things feel rushed and it made one have to consciously choose to skip a session in order to have a chat with someone. I would have found it preferable to have 45-minute sessions to leave time for some breaks.

In sum: there were lots of great, interesting and useful talks! Kudos to Lynn Bender and the others who organize the event each year. It really shines a great spotlight on Austin and helps our data-oriented community build and maintain momentum. A major improvement for future years would be to bring in more female speakers. (I’m sure the organizers would love to get recommendations for women to invite next year.) And, though the day itself only gave a few opportunities for hallway conversations, it wrapped up with a happy hour that gave all a chance to talk. I ended up not catching up with everyone I’d hoped to, but I had a number of great conversations. Best of all, I ended up having some excellent barbecue downtown with a great group, Austin style.

Request a demo of the People Pattern platform here.