First mentioned on Edge.org, and now percolating through the information research ecosystem is a new field called “Computational Social Science“. Another way I would describe it would be “Experimental Social Science.” What does that mean? Well, with the advent of the information age, and rich data collection facilitated through digital interfaces, we now have a lot of data to be able to tackle previously intractable or unmeasurable social phenomenon. Now, instead of relying on our intuition or limited set of personal experiences, we are able to ask and experimental answer questions of how human beings behave on a large scale, which is a tremendously exciting capability with some counter-intuitive findings, as you will see below.
Today, I am going to review the first part of a paper by Cameron Marlowe, who founded and used to lead the Data Science team at Facebook.
His paper, published in PNAS, is called Structural Diversity in Social Contagion.
In particular, the part that I am going to review is concerned with the first experiment he asked, which we will call the Recruitment Question:
“How does one’s contact neighborhood impact the decision to respond to an invitation to join FB?”
For the sake of the readers, I’m going to give you the actionable punchline before going into the analysis:
If you want to convert someone on FB, have 4-5 people who do not know each other, but all know the target, to send them an invitation.
Now, for the analysis….
Briefly, two definitions:
- Contact Neighborhood – the people (nodes) that are connected to a person of interest (target)
- For the Recruitment Question – Contact Neighborhood is derived as the set of people who have all imported the email address of the target
- Structural Diversity – the complexity and number of connected components of the neighborhood. For the Recruitment Question, we are going to focus on the number of connected components.
- The number of connected components is highly correlated with acceptance rate of invitation: ex. the higher number of separate connected components that exist with ties to the target, the higher the rate of acceptance. This is a very important graph, so pay attention! Note that as the number of connected components increases, recruitment success spikes up tremendously!
- The size of the contact neighborhood, after controlling for number of connected components, is actually negatively correlated with recruitment success
- Controlling for demographics to homogenous neighborhoods actually maintains this trend, so it is not the ‘culturally diversity’ of the contact neighborhood, but instead the ‘connected component diversity‘ nature of the neighborhood that is responsible for this type of conversion.
- Using co-tagging of pictures to infer connections finds further confirmation of this:
i. If two disconnected nodes have been co-tagged in a picture, the recruitment success is about the same as if the two nodes are connected
ii. If two connected nodes are co-tagged, (an indicator of increased connection strength) then the recruitment success drops even further.
Take Away Lesson:
In studying recruitment, it is not the number or size of the contact neighborhood, but the number of diverse endorsements (as in number of connected components) that is critical for recruitment conversion.
What it means practically:
If you want something to be adopted by a user, instead of relying on reaching as many number of people as possible, reach as many different types/groups of people as possible. In this example, the connected component number is actually a proxy for the underlying topology of different social groups, that are connected to themselves but not each other.
Food for thought:
Q: How does this generalize for other recruitment processes that are not Facebook?
Full Paper can be found here: http://cameronmarlow.com/media/ugander-structural-2012a.pdf
Like this? Leave a comment/like if you want me to write Pt2, which focuses on user engagement.
A very clear look at the current banking system, and how bitcoin fits into the model.
Twitter went mad last week because somebody had transferred almost $150m in a single Bitcoin transaction. This tweet was typical:
There was much comment about how expensive or difficult this would have been in the regular banking system – and this could well be true. But it also highlighted another point: in my expecience, almost nobody actually understands how payment systems work. That is: if you “wire” funds to a supplier or “make a payment” to a friend, how does the money get from your account to theirs?
In this article, I hope to change this situation by giving a very simple, but hopefully not oversimplified, survey of the landscape.
First, let’s establish some common ground
Perhaps the most important thing we need to realise about bank deposits is that they are liabilities…
View original post 2,350 more words
STRATA is massive. O’Reilly’s oversold Big Data 2013 conference had a record 3000 people attending – suggesting that big data was finally here. Just like big data, the sheer number of workshops offered was staggering and intimidating: 100 sessions between two days, and that’s not even counting Monday’s Tutorials. It’s impossible to see it all, but applying the mindset of data science, one way to deal with complexity is to use the narrative as a form to communicate interpretation.
I will share the things that stood out to me from the conference, and sacrifice comprehensiveness for story-telling. To get it in its entirety, you’ll just have to go next time!
The conference began on Tuesday morning, opened by Edd Dumbill and Alistair Croll. In a lightening round of 5 minute talks, 10 speakers from industry gave different takes and announcements for technology.
Each announcement shined a different light on how companies were approaching the big data space, and how it is evolving. Here are three highlights
Data Infrastructure Vendor – Moving to a central role in enterprise:
Mike Olson of Cloudera talked about the release of Cloudera 5, and the new conception of Cloudera’s offering as an Enterprise Data Hub. Behind the market positioning was the implication that Hadoop was maturing as a technology, and that mere storage was no longer enough. As Andrew Brust wrote for ZDNet following the keynote:
Here’s the gist: Cloudera is now referring to its Hadoop offering as an “Enterprise Data Hub.” In doing so, the company is staking a claim that Hadoop isn’t just a data scientist sandbox anymore.
In fact, and in the spirit of NYC DataWeek, we might characterize Cloudera’s take as Hadoop being the Ellis Island of data, an intake center if you will, where said data can be cleansed, shaped, aggregated, queried, indexed and searched, before heading elsewhere. And in some companies, Hadoop isn’t just Ellis Island, it’s Manhattan — where data comes to reside, and get monetized.
Big Data End User – Moving beyond infrastructure through analytics to impact:
Ken Rudin, head of FB Analytics, further hammered this theme home when he took the stage, saying that the end goal of data analysis was business relevance, and not just data storage. He went on to talk about infrastructure choices being decided by use case: FB actually started from Hadoop but adopted SQL to address different types of query needs.
On the subject of organization, he said that FB adopts a mixed relationship, where at least one analyst sits with each product/engineering team to help make product designs. This ensures that Analytics was able to meet its end goal, which is not to just do reactive reporting or provide insight – it was to make an impact: move a metric, improve a product, or change user behavior.
It was striking how data-driven and data-literate FB aimed to be – every employee goes through a two-week course of DataCamp. And what does FB look for in an analyst? 50% statistics/analysis, 50% business savvy.
For a short 5 minutes, Ken’s talk packed a bunch and had one of the highest spikes in twitter response.
Untapped reservoirs – Moving towards new sources of big data:
In following the metaphor of Hadoop and NoSQL technologies being referred to as ‘data lakes’ – an exciting new area of development was the source of its streams. For FB and many other adopters of big data technologies, the incoming stream is mostly events and web logs – records of how people interacted with the web product. As big data becomes more widely adopted by other industries and sectors, the type of information is changing to other sources as well.
Championing the potential behind this change was Michael Chui, Principal at McKinsey Global Institute. He concluded the Tuesday’s keynote session with a thoughtful look at how Open Data (machine-readable data) – especially from the government – could generate up to 3 trillion dollars of value in just 7 sectors. Applying the terminology from economics, Michael talked about the “liquidity” of data, as controlled by factors such as accessibility, machine-readability, costs and rights of re-usage. “The more eye-balls, the bigger the multiplier effect on data usage.” This vision of the future, the government had a substantial role to play – both in providing more data to be available to the public, and also setting the rules for how sharing will be done.
Intrigued by the implications of big data, I followed the train of thought to the session led by Robert Kirkpatrick, Director of UN Global Pulse, on Data Philanthropy: Private Public Sector Big Data Relationships. Started out of the executive office on the UN Secretary General, the UN Global Pulse is an initiative to take advantage of digital data sources and real-time analytics to predict emerging crises and understand human well-being around the world. They have done several projects so far, including studying “Digital Signals and Access to Finance in Kenya” to “Understanding Women and Employment in Indonesia,” and are in the midst of setting up Pulse Labs in different areas around the world.
Also on the panel were Mark Leiter – Chief Strategy Officer at Nielsen and Patrick Morrissey, VP of Marketing at SiftData, and Elizabeth Breese, PhD and social scientist at Crimson Hexagon. The session was a showcase of the motivation behind social philanthropy and the different use cases that it might be applied to.
As Mark describes, Nielsen is an information collection company that collects data about consumer behavior from 100 countries about what people are watching (TV) and buying. Under a program called Nielsen Cares – the company gives away millions of dollars worth of data every year for several reasons:
- in service of social good,
- as a way to encourage innovation and learning (Nielsen’s clients use the data they provide in specific ways, but working with Global Pulse as allowed them to find additional uses for their data),
- as a way to strengthen relationships with their partners.
Patrick, who came from the management of SalesForce, talk about the model of the 1-1-1 -that Salesforce adopted to give back to the community, pledging 1% of time, equity and profit for non-profit causes. He also mentioned that SiftData recently formed a partnership to Tumblr’s firehose – getting unlimited access to fifth most visited site in the US.
Elizabeth, who works as a social scientist at text-sentiment analysis startup Crimson Hexagon, described that the company did in conjunction with Global Pulse – drawing upon the tweetstream in Indonesia to correlate tweets with food price inflation.
Robert ended the session with the provocative note that weather forecasting, which we access for free today take for granted, was actually a massive undertaking by the public and private sector – spurring innovation in modeling, super-computing, and large data analysis. “Can you imagine what it would be like to create a socio-economic weather report?” he asked. To do that, “we still need to demonstrate use cases, and find forward thinking, progressive partners.”
With my mind lit by the amazing potential of data science analytics, I went into the conference to dig deeper about the infrastructure that undergirds big data and its analysis.
Big Data Infrastructure
Eddie Satterly delivered the best overview I found at the conference from Splunk in his talk “Big Data Architectural Patterns”. In it, Eddie talked about the evolution of data base management systems: from Relational Database Management Systems (RDBMS) to Cassandra to Hadoop. In the beginning was the relational database – which worked in conjunction with an Enterprise Data Warehouse (EDW) and a set of predefined operations to Extract, Load and Transform (ETL) the data from its sources (user traffic logs, etc) to operational data that business intelligence tools could sit on top of.
However, engineering the ETL process took many man-hours, and once in place – was expensive to change. This meant that many analyses were constrained by the decisions made at the beginning, and were not flexible to adopting new sources of data. Architectural systems began to look for alternatives, exploring the trade off between giving up parts of the database ACID guarantees (Atomic, Consistency, Isolation, Durability) in exchange for faster writing speed or the ability to store unstructured data. An entire movement called NoSQL was born, representative of which was the Cassandra project – a database architecture organized in a ‘ring of nodes’ instead of a central database, and which allowed users to “tune” the desired consistency versus speed trade-off.
Hadoop enters the picture with another take on the paradigm altogether. Instead of moving the data through the computations in the ETL cycle, and often ending up with multiple versions of the processed data, each differing from each other by a small amount, Hadoop allows one to store the data in a permanent place, and then bring the computation to the data. This is done through two components: a Hadoop distributed file system (HDFS) and Hadoop map-reduce function paradigm (HMR). The distributed file system allows one to use commodity hardware (instead of specialized hardware required to do distributed computing on Relational Databases) and solves the problem of server-failure at the software level. MapReduce allows one to write queries for the data in a distributed architecture, and take advantage of parallelization. Combined Hadoop creates a new way for companies to collect, store and query information at a larger scale.
Big Data Visualization
With all of this data in hand, the other tricky question is how to present it to people in an intuitive fashion. Amazingly, Sean Kandel (former graduate student of data viz wizard Jeff Heer and now co-founder of startup Trifacta) was present giving a workshop. The central principle of his presentation on visualization was that “perceptual and interactive scalability should be limited by the chosen resolution of the visualized data, not the number of records.” Following his advice, instead of presenting the granular data points of the topics he presented, I binned the topics for more efficient representation:
Process: Bin -> Aggregate -> (Smooth) -> Plot
The first step in visualizing large data sets is to choose the appropriate level to bin results, followed by calculation of aggregate statistics (such as mean, variance). In plotting, it is important to choose the right kind of representation depending on the dimension: 1D plots are still best represented by position or length encoding, whereas 2D plots can use area and color. Area, as it turns out, is better than color for magnitude estimation. Color has other nuances for perceptual representation too, such as the difference between mathematical linearity and perceptual linearity. Surprisingly, linear scaling of transparency to the human eye is not perceptually linear; cube-root scaling is much more effective.
Components: Select, Navigate and Query
In order to create interactive data plots on top of large data, it is necessary to pre-compute aggregates. Drawing upon the technique behind Google maps, Sean showed that pre-aggregation drastically reduces the number of bins one needs to sum across when displaying interactive visualizations. As an eye-opening example, if data is represented in the form of data cubes – a 5-Dimensional Cube might have 2.3 Billion Bins, while 13 pre-computed 3D cubes will only have 17.6 million Bins.
To understand the future of big data, it is useful to remember how new everything was and how fast it all developed. To gain a historical perspective, the last session I attended at STRATA 2013 was a retrospective look at the change in Internet usage over time, presented by Amie Elcan, Principal Architect at CenturyLink. CenturyLink is a landline Internet provider that services 6 million homes, and Amie’s job was to analyze usage patterns and identify ways to increase efficiency.
Having taught as a Teaching Fellow at Singularity University, I was no stranger to exponential curves, but it was still humbling to see how quickly Internet bandwidth usage has skyrocketed over the last 10 years – the growth from 2011 – 2012 was equal to all of the previous growth up to 2011. Internet bandwidth consumption is doubling every 24 months and a few ‘Hyper Giants’ – a relatively small group of commercial content providers- largely dominates it. Video downloads alone account for 70% of Internet traffic.
Amie notes that as the amount of information measured grows, it is no longer sufficient to measure just the volume of bits transferred- to get a better sense of Internet consumption, we need more sophisticated tools to measure the data.
Drawing this back into the bigger theme of STRATA, the arena of big data is as diverse and multi-faceted as data we are trying to collect. As our world becomes increasingly quantified, we need to incubate new use cases. The rapid development of Big Data and the Hadoop infrastructure is a start – now that we can reliably collect, store, and interact with the information at scale, we need to start asking ourselves the harder questions: What should we measure? How? What will we do with this new information? Going back to Ken Rudin’s keynote – big data is ultimately not about the infrastructure, but about the questions we can answer and new applications we can build. The road ahead is bright for data, as we move from just considering its volume to it use. Here comes data science!
You can find the presentations from the speakers here:
Ken Rudin – http://www.youtube.com/watch?v=RJFwsZwTBgg
Michael Chui – http://www.youtube.com/watch?v=iJofKo4PO38
Carl and I had a chance to sit in on an hour of Q and A with DJ Patil at Greylock recently, alongside the Princeton TigerTrek. Under the mantle of Data Science fame, I was surprised and pleased to find a very authentic person – a huge practioner of Carol Dweck‘s growth mindset, who is open about his own struggles, and despite all of his successes, remains humble. At the same time, there is also a fierce streak of determination – and the belief for change that accompanies all great entrepreneurs. A lot of what he said resonated with me, and I hope, with you:
Start with needs that you identify at a gut level. Use data to refine those feelings into concrete product decisions.
Even the term “Data Scientist” itself was born out of a need and decided by data. DJ and Jeff Hammerbacher shared a mutual problem. HR was asking for a common term to describe the people that they wanted to hire, as they respectively built out Linkedin and Facebook’s Data Science teams. After some thinking and clever A/B testing, they found that the term ‘data scientist’ attracted the right profile of people they wanted – strong analytical skills.
Ask for Help.
This is something that I struggle with. Often I want to hide away and study it on my own, but I’ve come to appreciate that often the best way to learn is from other people. DJ talks about not being afraid of asking for help – “the worst thing that someone can say is: ‘You should know this by now?’ – to which one can just respond: ‘That’s why I’m asking now!’” It takes courage and humility to say: “I’m struggling with X.”
Reality is always malleable, but some parts are more malleable then others. Place yourself in areas of ambiguity and chaos.
That is where you can have the most freeway in defining reality. Like the title of Amy Jo Martin’s book: “Renegades Write the Rules.” Axioms:
There are no such things as rules, just guidelines.
There are no such things as no, just yes’s that have not been said.
////////////////////// Some snippets of thought:
On Finding Direction:
When starting a company – ask the smartest 100 people around you – “what do you struggle with?”
Establish a goal. Pursue the goal. Be relentless in the pursuit of the goal.
Use the inverse function: decide where you want to be in 20 years, and work backwards from there. Ex. If you want to be CEO of a Fortune 10 company, you want to be X in 15 years. … This allows you to look for unconventional roles that might get you closer to your goal, instead of directionless prestige projection-pursuiting.
If you embark on a doctorate program, you must finish. Otherwise, you will leave irreversible psychological scars.
On civil service:
“I was raised on government funds, and feel a responsibility to give back to my community.” DJ has a passion for education, and in the long term, wants to go back and reform government.”
On Change in Companies:
Surprisingly, DJ says that it was an incredible struggle to get LinkedIn and Fb to be data-driven. Why as the adoption so hard? Inertia and people’s own agendas. Fortunately, DJ had spent time previously working in academic and government agencies, so he was ready to deal with bureaucracy. Where other entrepreneurial types would have thrown their hands up and given up in the face of the red-tape and patience required to get buy-in, DJ hunkered down and worked to convince stake-holders. One of his favorite statements when met with rejection was to say: “Okay, I hear that you are telling me ‘No’. Can I ask you, as a personal favor, if we could just set that aside for 5 minutes, and have a conversation about ‘How’. I am not asking you to commit to anything, just humor me.”
The ability to deal with and resolve conflict is also augmented by a novel definition of the word itself. As DJ says, “Conflict is when someone is in the way of your objective. Most people have the same objective, but tension arises when people are out of alignment.”
On Dealing with Others:
In terms of difficult moments, when someone has wronged you – the best way to respond is to let the other party know what the impact of the negative thing is for you, and then embrace the silence.
On where to work:
Ask yourself: Is this going to be a place where you are going learn an amazing amazing amount? Who is your boss, and will they help you learn?
On the best education for a data scientist:
Mathematics (Probability, not necessarily statistics), Computer Science and something in the humanities, like literature or theatre. In order to understand product, you have to have a deep grounding in something human.
the idea is that you are constantly growing the organization so that you can fire yourself.
You are only as good as the people you can hire.
How to attract the best people? Passion. Authentic passion in what you are doing.
When DJ hires a new person, he sits down to make a list of expectations with them. On the top of the list is the visibility of their career goals. The best way to keep talent is not to make it hard for them to move, but to show them that you are sincerely interested in developing themselves.
Perhaps we are looking at talent wrong – instead of looking at talent as an innate thing, you can grow talent by growing yourself, and those around you.
Tonight I had a chance to hear Tom and David Kelley of IDEO and d.school fame speak at the Kepler bookstore in Menlo Park. The topic of the night? Something that could not have been more timely – their new book on building and sustaining creative confidence.
I took some notes from the conversation (paraphrased in bold), and made an effort to make the points as concise and actionable as I can in plain text underneath. If any of these speak to you – give your feedback in comments!
1. Everyone can be creative. It is not a genetic gift, it is learned through cultivation and practice. David says: people think that if you come out of the womb and cant draw, its all over for you – whereas if you are born and can’t play the piano off the bat – that’s okay.
If you don’t consider yourself a creative type right now, that’s okay! Its been beaten of you by traditional system and definitions, but don’t worry it can be restored by the tips below.
2. Practice involves guided mastery – someone more experienced holding you by the hand and showing you step by step. To improve yourself at something – find a mentor to teach you incrementally; human beings are excellent imitators.
3. The kind of productive creativity that David/Tom/IDEO/d.school practice is called Design Thinking. The central idea is to empathize with human beings. Look at what is meaningful for people. This is a huge untapped opportunity – people start new businesses around new technology, and new business models – but the new business that needs will change everything.
This is a huge untapped business opportunity. Ask yourself carefully what people need – validate that need – and build it.
4. To understand people – you have to understand their motivations – and the easiest way to do that is to ask: “Why?” normal response… “Why?” deeper response… “Why?” somewhere they have not gone before… “Why?” …
4.5 To get things started – begin by asking: “How can we ….”
It assumes that the thing can be done, that we are in this together all in the first 3 words!
5. Its okay to do the same thing over and over again – that is how you get better.
6. The grade system in schools is a terrible motivation system. It teaches you to be good at predicting what the professor wants. It does not encourage you to fail.
I propose a more productive definition of failure – when things dont happen the way that you want them to because you stopped working on it. That is the only definition of failure that matters. I am so happy to be on leave from school at the moment – brown has great classes, but being tested in the school of life is a whole nother level of awesome.
7. In school you are taught Problem Solving. In the world, you also need Design Thinking – deciding what problems are important to solve, and sometimes even rephrasing the problems that clients give to you.
Again, always ask why. Check to make sure that the problem is actually a problem for the people who will “benefit” from it.
1. I need to get into a d.school project/class.
2. Its better to learn things when you need them in service of a bigger project.
3. I need to get my hands dirty more.
Final surprise – on my way back from the talk, I ride into the middle of a film scene! The crew is filming excerpts from James Franco‘s book Palo Alto. I ask: “How can I help?” One of the producers introduced me to the head of the Art team. I’ll be waking up at 8 am tomorrow to start making things. Woo!