Of all the datasets that could be delivered to your desk, the most difficult one to work with might be that big dataset. Besides its massive size, it’s exponential growth even as you work on it, and the variety of data types present, big data presents many issues that make it difficult to turn data into action. In this presentation, you will learn how to take thousands of variables and billions of records and turn them into useable and actionable results, just as you would with any traditional research dataset.
10. @PeanutLabsMedia @LoveStats
Which survey topics do
you enjoy answering?
Which candidate for Prime
Minister did you vote for?
How concerned are you about
immigration policies?
11. @PeanutLabsMedia @LoveStats
Which survey lengths generate
the lowest data quality?
Which products do high
value customers purchase?
Which health issues are associated
with the lowest income households?
15. @PeanutLabsMedia @LoveStats
57% of people prefer Star
Wars to Star Trek
N=394, MOE=5%, http://www.correlated.org/724
But among those who like
ketchup on eggs, 58% prefer
Star Trek
24. @PeanutLabsMedia @LoveStats
Thank you!
Annie Pettit
Chief Research Officer
annie@peanutlabs.com
ca.linkedin.com/in/AnniePettit/
facebook.com/AnniePettit
twitter.com/LoveStats
Jonathan Cheriff
Director of Sales & Marketing
jonathan.cheriff@peanutlabs.com
Find PeanutLabs on
LinkedIn Facebook Twitter YouTube
DIY SURVEY PROGRAMMING DIY POLLINGDIY SAMPLE
Notas do Editor
Everyone of us has big data back at the office. For me, it’s terabytes of research panel data, demographics, survey clicks and completes, data quality flags. For you, it might be eye movements or purchase data or website trackers. Whatever your big data is, many of us have the same problem. How to take terabytes of data just sitting around make it actionable. Fortunately, I’m going to give you some basic tips based on my own personal experience. Feel my pain!
The most important thing you need to make big data actionable is a captain who believes in the potential of big data. Your captain needs to make sure her team knows that big data is important and that it can make the ship run better. She also needs to make space at the table, make opportunities for big data to shine at the same time they are evaluating other sources of data. Your captain also needs to ensure that the science officers get the time they require to do the work. It’s nice to say that you think big data is cool, but if you’re only going to give someone an hour to work on it, then you’re not really leading. Just like running a survey project, big data takes weeks of hard work. And big projects take years.
The next big step is to get permission from the engineers to access all the relevant data, not just a tiny subset of instantly out of date, convenient to download data. The engineers cant simply can’t let anyone wander through variables with email address, phone numbers, health and illness information or they’d be fired. They need to take appropriate steps to ensure that only those people who absolutely need to access pieces of data can access it. And the easiest way to do that is if they just give you a little piece to play with offline. Also, the science officers must have a place on the server where they can play. It needs to be a safe place where there are no concerns about deleting essential data. It needs to be a sandbox where the scientists can read and write data. Create complicated new tables, new variables. Simply seeing data isn’t enough. They need to manipulate the data.
Then, once the captain has given the science officers permission to take the time to do the work and the engineers have created a safe sandbox where there is no fear of taking down the company because someone used the sql command drop by mistake, the science officer can get to work and analyze the data. And that’s the fun part.
That’s where the science officers discover things like this. Does anyone know what this number is? GIVEAWAY
You see, data by itself is never insightful, not survey data not focus group data. Cool numbers don’t just appear out of data. A few people here happened to know why 47 is an interesting number because they had background information that no one else had. Seems kind of unfair doesn’t it. Well it’s not. It’s a simple problem.
recur in dialogue and on computer screens in Star Trek. traced to writer Joe Menosky. Mathematical proof that all numbers are equal to 47, and that the number 47 occurs with greater frequency than other numbers.
Here is the problem. This is how we run every survey and focus group. We spend time carefully identifying the problem we’re experiencing. We talk about the history of star trek, who was involved with it, what they contributed to it. We discuss the kinds of questions we want to ask, the kinds of answers we’re hoping to find. Based on all of that, we create a measurement tool. Maybe a questionnaire, maybe a discussion guide. Then, we decide how we want to collect data and who we’ll collect it from. Then, finally, weeks or months after first identifying the problem, we analyze data. In the survey world, we never jump instantly into analyzing data without going through hoops first. We don’t just KNOW what 47 means. So let’s talk about how these stages happen with big data.
First of all, we can’t just lay down a gauntlet of data. The captain can’t say here’s the data, find something cool. We need to focus on a problem, one problem. We need to focus on the worst problem, the one date in question, the one KPI that really matters to our business. We need to focus. With the captain, you need to choose one KPI. Think about it carefully. Figure out what matters, what’s relevant. You can’t start with data. You have to start with a problem.
Then, choose an actionable goal. It needs to be measurable, definable. So for many of us looking a research panel databases, one of those problems might be improving survey satisfaction. That seems like a good goal. But it’s not. There’s nothing actionable, measureable in this. HOOOOOW do you know if you’ve improved satisfaction.
That’s what really matters. You need goals that can be measured using the data you have at hand. How much did data quality scores increase? How much did screenouts drop? You need to create a very specific goal.
Now think back again to survey research. You took a lot of time coming up with the questions to put on your survey. You were careful about which words you chose and what the possible answers should be. You spent hours, maybe days or weeks fine tuning every question to capture answers that would be as valid as possible. And while you fine tuned those questions, you argued with yourself and your colleagues and your clients about all the possible outcomes. You argued about bias and skews. You refined your goals and hypotheses along the way. You spent hours thinking carefully about all aspects of the plan.
You really need to do the same with big data. You need to spend the hours and days thinking about possible questions to ask, problems that might arise, hypotheses to evaluate and change. You need to think about what kinds of questions can be asked of the data you have and what kinds of answers won’t come out of that data. Spend the same amount of time thinking about your big data questions as you would writing your survey questions. So let’s say we’ve finished asking all those questions. What’s next.
Well, if we go back to survey data, we run a whole lot of little tests. What are the maximum values for every variable. What are the minimums. What are the outliers. What is the 5th and 95th percentile. Is it normal? Is it full of numbers even though it’s labeled text? We take the time to understand the ins and outs of every single variable.
So we need to do the same thing with big data. The only problem here is that you’ve got thousands of variables to work with. And chances are almost none of them have names that means anything. These are variable names from some of the SQL tables I work with. I don’t even know what all of these are. Your science officers must be given the necessary time to review all the potentially relevant variables, figure out what they mean, which ones are relevant. Then, they have to do all the standard basic analyses. What are the maximums. What are the minimums. What are the outliers. What is the 5th and 95th percentile. Is it normal? Is it text or numeric? And then, only then, this is what you find.
You find a whole lot of data that you didn’t want to find. You find mistakes, miscodes, blanks, and more. But this is okay. Because as captains of your ship, as leaders in your company, you have given your science officers the time they must have to deal with data quality. You know there will be errors in the data and you need to give your science officers the time to find those errors a figure out how to deal with them. This is not bad data. This is data. You find mistakes in your survey data all the time so this is not the time to cancel all big data projects because you found a mistake.
Stand up if you like ketchup on their eggs? Sit down if you don’t like star trek. GIVEAWAY
So now your science officers understand what the variables are, what the data looks like. They’re charting this and that and everything. Correlating every variable with every other variable. Making 3D pie charts and giant tables. They’re charting anything because they’re curious. And you know, along the way they’re going to learn things like this. They’re going to come across strange and weird correlations and you’re going to get excited about those correlations and want to act on them right away. Maybe even publish some of them.
Liking ketchup on eggs causes you to like star trek
But all along, you must remember science. Any good science officer knows that correlation does not mean causation. Liking eggs does not mean you like star trek. A good science officer is smart enough to consider type 1 and type 2 errors. They know that a correlation in a tiny subset of 100 people is probably not meaningful. They know to think about main effects and interactions. There is so much on the mind of a science officer while they are working that it pretty much feels like this.
It feels like shooting at everything. And really, during the exploration phase that has to happen with every big data project, this is what your science officers are doing. Shooting at everything. Seeing what sticks. Which incentives work better and where? Which data quality processes work better? Which demographic groups respond better to which types of surveys. And you know what, this is all okay. But it needs to be in context. You can’t do this forever. At some point…
This is probably my biggest problem on the way to getting actionable data. I get so engrossed with thousands of variables and hundreds of tables and terabytes of data, that I don’t realize it’s Thursday and my report is due tomorrow. As with any research project, you must draw a line. A deadline. Results must be timely. You have to budget a specific numbers of hours to play with the data and specific time when the playing is finished and you will answer the specific questions at hand. Of course, at the time, that seems impossible. That’s why you were shooting everywhere. But if you step back for a moment, think about all the shots you took, and then focused on the few that were actually promising, then you can answer the question.
And don’t worry if you didn’t find an insight. Insight truly is a rare gem that happens extremely infrequently. You’re lucky to find a single insight in a project and chances are you probably only find an insight in every few projects. That also does not make big data a failure. But once you’ve drawn the line, and come up with the results.
The next step is to make solid conclusions. Come up with solid decisions. Come up with hypotheses that you can act on in the real world and evaluate in the data. Run a test control experiment and see if you can replicate the data. A single stunning finding is what gets published in journals. But if all your replications fail, then you know that single stunning finding was just a type 1 or type 2 error. You must replicate either with hold back samples or by retesting.
Even better, converge data sources. There is more to space than star trek. There’s also star wars. Maybe even battlestar galactica, stargate sg1, firefly, lost in space, buck rogers. What you see in one show might be relevant for understanding another. The way you understand time travel in one might help you understand the other.
Well, converging data sources helps with big data too. Actionable big data does not mean that it must function perfectly by itself. It can be actionable by informing other pieces of data. By corroborating other methods. When data works together, insights will come more easily. The way you understand survey lengths as written in a sql database might help you understand what is happening in a focus group.
Big data is not a magical methodology that spits out actionable insights to the person in front of it. It needs to the same kind of attention that we apply to every methodology. It needs a captain who will promote it and give people the time to do it. It needs science officers who have enough statistical knowledge to pick it apart and understand all the intricacies. And it really needs people to respect the possibilities. So boldly explore that data where no one has explored before.
Thank you
Veil Nebula Supernova Remnant: NASA's Hubble Space Telescope has unveiled the Veil Nebula - star that exploded 8,000 years ago.