STRATA is massive. O’Reilly’s oversold Big Data 2013 conference had a record 3000 people attending – suggesting that big data was finally here. Just like big data, the sheer number of workshops offered was staggering and intimidating: 100 sessions between two days, and that’s not even counting Monday’s Tutorials. It’s impossible to see it all, but applying the mindset of data science, one way to deal with complexity is to use the narrative as a form to communicate interpretation.
I will share the things that stood out to me from the conference, and sacrifice comprehensiveness for story-telling. To get it in its entirety, you’ll just have to go next time!
The conference began on Tuesday morning, opened by Edd Dumbill and Alistair Croll. In a lightening round of 5 minute talks, 10 speakers from industry gave different takes and announcements for technology.
Each announcement shined a different light on how companies were approaching the big data space, and how it is evolving. Here are three highlights
Data Infrastructure Vendor – Moving to a central role in enterprise:
Mike Olson of Cloudera talked about the release of Cloudera 5, and the new conception of Cloudera’s offering as an Enterprise Data Hub. Behind the market positioning was the implication that Hadoop was maturing as a technology, and that mere storage was no longer enough. As Andrew Brust wrote for ZDNet following the keynote:
Here’s the gist: Cloudera is now referring to its Hadoop offering as an “Enterprise Data Hub.” In doing so, the company is staking a claim that Hadoop isn’t just a data scientist sandbox anymore.
In fact, and in the spirit of NYC DataWeek, we might characterize Cloudera’s take as Hadoop being the Ellis Island of data, an intake center if you will, where said data can be cleansed, shaped, aggregated, queried, indexed and searched, before heading elsewhere. And in some companies, Hadoop isn’t just Ellis Island, it’s Manhattan — where data comes to reside, and get monetized.
Big Data End User – Moving beyond infrastructure through analytics to impact:
Ken Rudin, head of FB Analytics, further hammered this theme home when he took the stage, saying that the end goal of data analysis was business relevance, and not just data storage. He went on to talk about infrastructure choices being decided by use case: FB actually started from Hadoop but adopted SQL to address different types of query needs.
On the subject of organization, he said that FB adopts a mixed relationship, where at least one analyst sits with each product/engineering team to help make product designs. This ensures that Analytics was able to meet its end goal, which is not to just do reactive reporting or provide insight – it was to make an impact: move a metric, improve a product, or change user behavior.
It was striking how data-driven and data-literate FB aimed to be – every employee goes through a two-week course of DataCamp. And what does FB look for in an analyst? 50% statistics/analysis, 50% business savvy.
For a short 5 minutes, Ken’s talk packed a bunch and had one of the highest spikes in twitter response.
Untapped reservoirs – Moving towards new sources of big data:
In following the metaphor of Hadoop and NoSQL technologies being referred to as ‘data lakes’ – an exciting new area of development was the source of its streams. For FB and many other adopters of big data technologies, the incoming stream is mostly events and web logs – records of how people interacted with the web product. As big data becomes more widely adopted by other industries and sectors, the type of information is changing to other sources as well.
Championing the potential behind this change was Michael Chui, Principal at McKinsey Global Institute. He concluded the Tuesday’s keynote session with a thoughtful look at how Open Data (machine-readable data) – especially from the government – could generate up to 3 trillion dollars of value in just 7 sectors. Applying the terminology from economics, Michael talked about the “liquidity” of data, as controlled by factors such as accessibility, machine-readability, costs and rights of re-usage. “The more eye-balls, the bigger the multiplier effect on data usage.” This vision of the future, the government had a substantial role to play – both in providing more data to be available to the public, and also setting the rules for how sharing will be done.
Intrigued by the implications of big data, I followed the train of thought to the session led by Robert Kirkpatrick, Director of UN Global Pulse, on Data Philanthropy: Private Public Sector Big Data Relationships. Started out of the executive office on the UN Secretary General, the UN Global Pulse is an initiative to take advantage of digital data sources and real-time analytics to predict emerging crises and understand human well-being around the world. They have done several projects so far, including studying “Digital Signals and Access to Finance in Kenya” to “Understanding Women and Employment in Indonesia,” and are in the midst of setting up Pulse Labs in different areas around the world.
Also on the panel were Mark Leiter – Chief Strategy Officer at Nielsen and Patrick Morrissey, VP of Marketing at SiftData, and Elizabeth Breese, PhD and social scientist at Crimson Hexagon. The session was a showcase of the motivation behind social philanthropy and the different use cases that it might be applied to.
As Mark describes, Nielsen is an information collection company that collects data about consumer behavior from 100 countries about what people are watching (TV) and buying. Under a program called Nielsen Cares – the company gives away millions of dollars worth of data every year for several reasons:
- in service of social good,
- as a way to encourage innovation and learning (Nielsen’s clients use the data they provide in specific ways, but working with Global Pulse as allowed them to find additional uses for their data),
- as a way to strengthen relationships with their partners.
Patrick, who came from the management of SalesForce, talk about the model of the 1-1-1 -that Salesforce adopted to give back to the community, pledging 1% of time, equity and profit for non-profit causes. He also mentioned that SiftData recently formed a partnership to Tumblr’s firehose – getting unlimited access to fifth most visited site in the US.
Elizabeth, who works as a social scientist at text-sentiment analysis startup Crimson Hexagon, described that the company did in conjunction with Global Pulse – drawing upon the tweetstream in Indonesia to correlate tweets with food price inflation.
Robert ended the session with the provocative note that weather forecasting, which we access for free today take for granted, was actually a massive undertaking by the public and private sector – spurring innovation in modeling, super-computing, and large data analysis. “Can you imagine what it would be like to create a socio-economic weather report?” he asked. To do that, “we still need to demonstrate use cases, and find forward thinking, progressive partners.”
With my mind lit by the amazing potential of data science analytics, I went into the conference to dig deeper about the infrastructure that undergirds big data and its analysis.
Big Data Infrastructure
Eddie Satterly delivered the best overview I found at the conference from Splunk in his talk “Big Data Architectural Patterns”. In it, Eddie talked about the evolution of data base management systems: from Relational Database Management Systems (RDBMS) to Cassandra to Hadoop. In the beginning was the relational database – which worked in conjunction with an Enterprise Data Warehouse (EDW) and a set of predefined operations to Extract, Load and Transform (ETL) the data from its sources (user traffic logs, etc) to operational data that business intelligence tools could sit on top of.
However, engineering the ETL process took many man-hours, and once in place – was expensive to change. This meant that many analyses were constrained by the decisions made at the beginning, and were not flexible to adopting new sources of data. Architectural systems began to look for alternatives, exploring the trade off between giving up parts of the database ACID guarantees (Atomic, Consistency, Isolation, Durability) in exchange for faster writing speed or the ability to store unstructured data. An entire movement called NoSQL was born, representative of which was the Cassandra project – a database architecture organized in a ‘ring of nodes’ instead of a central database, and which allowed users to “tune” the desired consistency versus speed trade-off.
Hadoop enters the picture with another take on the paradigm altogether. Instead of moving the data through the computations in the ETL cycle, and often ending up with multiple versions of the processed data, each differing from each other by a small amount, Hadoop allows one to store the data in a permanent place, and then bring the computation to the data. This is done through two components: a Hadoop distributed file system (HDFS) and Hadoop map-reduce function paradigm (HMR). The distributed file system allows one to use commodity hardware (instead of specialized hardware required to do distributed computing on Relational Databases) and solves the problem of server-failure at the software level. MapReduce allows one to write queries for the data in a distributed architecture, and take advantage of parallelization. Combined Hadoop creates a new way for companies to collect, store and query information at a larger scale.
Big Data Visualization
With all of this data in hand, the other tricky question is how to present it to people in an intuitive fashion. Amazingly, Sean Kandel (former graduate student of data viz wizard Jeff Heer and now co-founder of startup Trifacta) was present giving a workshop. The central principle of his presentation on visualization was that “perceptual and interactive scalability should be limited by the chosen resolution of the visualized data, not the number of records.” Following his advice, instead of presenting the granular data points of the topics he presented, I binned the topics for more efficient representation:
Process: Bin -> Aggregate -> (Smooth) -> Plot
The first step in visualizing large data sets is to choose the appropriate level to bin results, followed by calculation of aggregate statistics (such as mean, variance). In plotting, it is important to choose the right kind of representation depending on the dimension: 1D plots are still best represented by position or length encoding, whereas 2D plots can use area and color. Area, as it turns out, is better than color for magnitude estimation. Color has other nuances for perceptual representation too, such as the difference between mathematical linearity and perceptual linearity. Surprisingly, linear scaling of transparency to the human eye is not perceptually linear; cube-root scaling is much more effective.
Components: Select, Navigate and Query
In order to create interactive data plots on top of large data, it is necessary to pre-compute aggregates. Drawing upon the technique behind Google maps, Sean showed that pre-aggregation drastically reduces the number of bins one needs to sum across when displaying interactive visualizations. As an eye-opening example, if data is represented in the form of data cubes – a 5-Dimensional Cube might have 2.3 Billion Bins, while 13 pre-computed 3D cubes will only have 17.6 million Bins.
To understand the future of big data, it is useful to remember how new everything was and how fast it all developed. To gain a historical perspective, the last session I attended at STRATA 2013 was a retrospective look at the change in Internet usage over time, presented by Amie Elcan, Principal Architect at CenturyLink. CenturyLink is a landline Internet provider that services 6 million homes, and Amie’s job was to analyze usage patterns and identify ways to increase efficiency.
Having taught as a Teaching Fellow at Singularity University, I was no stranger to exponential curves, but it was still humbling to see how quickly Internet bandwidth usage has skyrocketed over the last 10 years – the growth from 2011 – 2012 was equal to all of the previous growth up to 2011. Internet bandwidth consumption is doubling every 24 months and a few ‘Hyper Giants’ – a relatively small group of commercial content providers- largely dominates it. Video downloads alone account for 70% of Internet traffic.
Amie notes that as the amount of information measured grows, it is no longer sufficient to measure just the volume of bits transferred- to get a better sense of Internet consumption, we need more sophisticated tools to measure the data.
Drawing this back into the bigger theme of STRATA, the arena of big data is as diverse and multi-faceted as data we are trying to collect. As our world becomes increasingly quantified, we need to incubate new use cases. The rapid development of Big Data and the Hadoop infrastructure is a start – now that we can reliably collect, store, and interact with the information at scale, we need to start asking ourselves the harder questions: What should we measure? How? What will we do with this new information? Going back to Ken Rudin’s keynote – big data is ultimately not about the infrastructure, but about the questions we can answer and new applications we can build. The road ahead is bright for data, as we move from just considering its volume to it use. Here comes data science!
You can find the presentations from the speakers here:
Ken Rudin – http://www.youtube.com/watch?v=RJFwsZwTBgg
Michael Chui – http://www.youtube.com/watch?v=iJofKo4PO38