Highlights from #ApacheCon Europe

Highlights from #ApacheCon Europe

Last week I was fortunate to attend ApacheCon Europe for the first time. The event took place in scenic, historic Budapest over four days. This was the first ApacheCon to adopt a new format where the conference was divided two parts: ApacheCon Big Data, focused on the many Apache projects relevant to big data professionals, and ApacheCon Core, focused on other Apache projects and discussions about the Apache Foundation, the Apache community, and the Apache Way. The next ApacheCon North America in Vancouver next May is set to use this format as well.

In this blog post I’ll refer readers to some of the most interesting talks I attended and provide color commentary:

Large-Scale Stream Processing in the Hadoop Ecosystem – Gyula Fóra, SICs and Márton Balassi, Hungarian Academy of Sciences

http://events.linuxfoundation.org/sites/events/files/slides/OpenSourceStreaming_0.pdf

This presentation is a great survey of four Apache big data stream processing platforms in: Storm, Samza, Spark Streaming, and Flink. It breaks down the similarities and differences among the platforms along multiple dimensions, and provides some code examples from each. These slides are a great read for anyone in the process of deciding what streaming platform best suits their use case.

One-Click Hadoop Clusters – Anywhere (Using Docker) – Janos Matyas, Hortonworks

http://events.linuxfoundation.org/sites/events/files/slides/Cloudbreak%20-%20Budapest.pdf

This presentation summarizes the tech stack that was used building Cloudbreak and Periscope, products now within the Hortonworks family that use docker containers to spin up and monitor/scale Hadoop clusters, respectively.

Architecture of Flink’s Streaming Runtime – Robert Metzger

http://events.linuxfoundation.org/sites/events/files/slides/ACEU15-FlinkArchv3.pdf

This presentation digs deep into the architecture and capabilities of Apache Flink, describing in detail how Flink is able to deliver low end-to-end latency, high throughput, fault-tolerance, and exactly-once message delivery (no small feat). Batch processing and higher-level abstractions and libraries are discussed as well. By the way, the schedule for Flink Forward in Berlin next week ( http://flink-forward.org/ ) looks excellent – consider going if you have the opportunity.

What’s New With Apache Tika? – Nick Burch

http://events.linuxfoundation.org/sites/events/files/slides/WhatsNewWithApacheTika.pdf

Nick walked everyone through the purpose, history, and near future of Apache Tika, a tool I’ve been using to derive metadata from files on the web for several years. In summary, Tika continues to add support for new file formats, now supports performing OCR on images to extract text (sweet…) and a hadoop processing mode is in the works. Additionally, the talk (and the slides) contains some great advice for those adding Tika to their solutions, starting on slide 55.

Being Ready for Apache Kafka: Today’s Ecosystem and Future Roadmap – Michael Noll, Confluent

http://events.linuxfoundation.org/sites/events/files/slides/Apache%20Big%20Data%20-%20Being%20Ready%20for%20Apache%20Kafka%20-%20Final.pdf

Completely full house for this talk–Michael discussed changes and new features in the upcoming Kafka 0.9.0 release, as well as some additional Kafka ecosystem projects. In 0.9.0, producers and consumers will no longer need zookeeper connections, and the revamped Java consumer is finally ready to ship. New features include: copycat: a tool for copying sets of documents between supported data repositories via Kafka, and kafka streams: a stream processing framework akin to storm or samza but embeddable into JVM code, and transparently coupled to Kafka. It was great to learn about the Confluent Schema Registry—methods for describing and publishing data schemas being central to my own talk at ApacheCon Europe. 

Netflix: Integrating Spark at Petabyte Scale – Cheolsoo Park, Netflix and Ashwin Shankar, Netflix

http://events.linuxfoundation.org/sites/events/files/slides/Netflix%20Integrating%20Spark%20at%20Petabyte%20Scale.pdf

As is the case with quite a few technologies, Netflix Engineering is pushing Apache Spark to the limits under real-world conditions. This presentation walked the audience through current use cases for Spark at Netflix, and a number of the scaling challenges and bugs encountered (and overcome) during the journey.

Integrating Fully-Managed Data Streaming Services with Apache Samza – Renato Marroquinm ETH Zurich

Although support for multiple message passing technologies was a founding principle of Samza, Kafka support is what works “out-of-the-box.” Renato discusses his experience actually hooking Samza streams up to Amazon Kinesis, deep diving into the architectures and assumptions of Samza, Kafka, and Kinesis along the way.

Finally, a plug for the talk that I gave on Monday:

Overcoming the Many-to-Many Data Mapping Mess With Apache Streams – Steve Blackmon, People Pattern

http://events.linuxfoundation.org/sites/events/files/slides/Apache%20streams%20Budapest%202015_0.pdf

These days we have the tools and resources to collect and wrangle data at unprecedented scale, yet we remain plagued by compatibility gaps and semantic nuances with every new source we invite into our domain. Despite the decades-long best efforts of well-meaning folks, data integration remains a many-to-many problem.

Apache Streams (incubating) is an open-source real-time reference implementation for the Activity Streams specification. Streams contains libraries and patterns for specifying, publishing, and interlinking schemas, and assists with conversion of activities and objects between the representation, format, and encoding preferred by supported data providers, processors, and indexes.

In this talk I discuss the proliferation of data modelling and technology choices solution architects face, how the Activity Streams specification set out to make things simpler, how Apache Streams helps teams realize that potential, and techniques for using Apache Streams to tackle data integration problems in a fundamentally scalable manner.

Request a demo of the People Pattern platform here.