SlideShare uma empresa Scribd logo
1 de 20
@mrogati
hottest industries The Mission:
+  date  joined  LinkedIn The data
hottest industries Hotness (X)  =  	Year-over-year growth of  	people in industry X  	on LinkedIn  The Question
hottest industries Hotness (X)  =  	Year-over-year growth of  	people in industry X  	on LinkedIn  The Question
The data
hottest industries Hotness (X)  =  	Year-over-year growth of  people job starters  	in industry X  	on LinkedIn  The Question
Externa-lies
Externa-lies
Externa-lies
hottest industries Hotness (X)  =  part year-over-part year growth of  net job starters   	in a big enough industry X   	on LinkedIn  The Question
Dirty data, dirty lies
# profiles # jobs on LinkedIn profile * Dirty data, dirty lies * hypothetical data
Check flags,  categories, dates, … Dirty data, dirty lies
Norma-lies
Hotness (X)  =  part year-over-part year growth of  normalizednet job starters,  	minus noise,   	in a big enough industry X   	on LinkedIn  hottest industries The Question
Norma-lies
Internet Real Estate Financial Services Truth by omission
… and the data scientist
@mrogati

Mais conteúdo relacionado

Semelhante a Lies, damned lies and the data scientist 2011 strata summit

Getting started-jan-9-2018
Getting started-jan-9-2018Getting started-jan-9-2018
Getting started-jan-9-2018
Thinkful
 

Semelhante a Lies, damned lies and the data scientist 2011 strata summit (20)

Data Does Xmas - Winners
Data Does Xmas - WinnersData Does Xmas - Winners
Data Does Xmas - Winners
 
Online text data for machine learning, data science, and research - Who can p...
Online text data for machine learning, data science, and research - Who can p...Online text data for machine learning, data science, and research - Who can p...
Online text data for machine learning, data science, and research - Who can p...
 
Digital trends and strategy for a Digital Future - Guest Lecture Nyenrode Bus...
Digital trends and strategy for a Digital Future - Guest Lecture Nyenrode Bus...Digital trends and strategy for a Digital Future - Guest Lecture Nyenrode Bus...
Digital trends and strategy for a Digital Future - Guest Lecture Nyenrode Bus...
 
Data Scientist - Good Rebels -
Data Scientist - Good Rebels -Data Scientist - Good Rebels -
Data Scientist - Good Rebels -
 
DataCamp investor deck April 2015
DataCamp investor deck April 2015DataCamp investor deck April 2015
DataCamp investor deck April 2015
 
Big Data Berlin 2019 v.17 | What makes you tech? | Elena Poughia | Founder at...
Big Data Berlin 2019 v.17 | What makes you tech? | Elena Poughia | Founder at...Big Data Berlin 2019 v.17 | What makes you tech? | Elena Poughia | Founder at...
Big Data Berlin 2019 v.17 | What makes you tech? | Elena Poughia | Founder at...
 
Data Con LA 2020 Keynote - Bryan Kirschner
Data Con LA 2020 Keynote - Bryan KirschnerData Con LA 2020 Keynote - Bryan Kirschner
Data Con LA 2020 Keynote - Bryan Kirschner
 
Corporate Data, Supply Chains Vulnerable to Cyber Crime Attacks from Outside ...
Corporate Data, Supply Chains Vulnerable to Cyber Crime Attacks from Outside ...Corporate Data, Supply Chains Vulnerable to Cyber Crime Attacks from Outside ...
Corporate Data, Supply Chains Vulnerable to Cyber Crime Attacks from Outside ...
 
DataMarket at Media 3.0 in Bergen
DataMarket at Media 3.0 in BergenDataMarket at Media 3.0 in Bergen
DataMarket at Media 3.0 in Bergen
 
Germany Executive Summit at LinkedIn
Germany Executive Summit at LinkedInGermany Executive Summit at LinkedIn
Germany Executive Summit at LinkedIn
 
Big Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR CongressBig Data and HR - Talk @SwissHR Congress
Big Data and HR - Talk @SwissHR Congress
 
Nvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer TöglhoferNvidia GTC18 Platzer Töglhofer
Nvidia GTC18 Platzer Töglhofer
 
Fairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedInFairness, Transparency, and Privacy in AI @LinkedIn
Fairness, Transparency, and Privacy in AI @LinkedIn
 
Getting Started in Tech
Getting Started in TechGetting Started in Tech
Getting Started in Tech
 
Big Data for Business & Social Innovation
Big Data for Business & Social InnovationBig Data for Business & Social Innovation
Big Data for Business & Social Innovation
 
Data Scientist:The Sexiest Job of 21st Century
Data Scientist:The Sexiest Job of 21st CenturyData Scientist:The Sexiest Job of 21st Century
Data Scientist:The Sexiest Job of 21st Century
 
EPR Annual Conference 2020 Workshop 1 - Simon Uytterhoeven
EPR Annual Conference 2020 Workshop 1 - Simon Uytterhoeven EPR Annual Conference 2020 Workshop 1 - Simon Uytterhoeven
EPR Annual Conference 2020 Workshop 1 - Simon Uytterhoeven
 
Information Innovation: Turning Insights into Opportunities
Information Innovation: Turning Insights into OpportunitiesInformation Innovation: Turning Insights into Opportunities
Information Innovation: Turning Insights into Opportunities
 
Getting started-jan-9-2018
Getting started-jan-9-2018Getting started-jan-9-2018
Getting started-jan-9-2018
 
Q1 2017 Cherry Tree IT Services Newsletter
Q1 2017 Cherry Tree IT Services NewsletterQ1 2017 Cherry Tree IT Services Newsletter
Q1 2017 Cherry Tree IT Services Newsletter
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Último (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Lies, damned lies and the data scientist 2011 strata summit

Notas do Editor

  1. Lies, Damned Lies and the Data Scientist By: MonicaRogati – data scientist at LinkedIn.Data lies – but it lies because we let it. So let’s not let it. Let’s ask the right questions.
  2. I’m going to talk about how to ask the right question by showing you a a deceptively simple exercise that LinkedIn data scientists go through. The question is, what are the hottest industries this year, according to the LinkedIn data? There’s one small detail I’m not specifying – what’s the definition of hot. That definition plays a major part in asking the right questions.
  3. SO let’s take a look at the data. On LinkedIn, we have over 120M people, their industry, and the year they joined.
  4. … so the first attempt at defining “hot” might be – let’s look at the YOY growth of an industry & look at the top 3. That idea is not so hot – at best, it’s only an indicatorof LinkedIn’s penetration in an industry; at worst, it’s actually a contrarian indicator because it shows people who might want to transition OUT of that industry
  5. … so the first attempt at defining “hot” might be – let’s look at the YOY growth of an industry & look at the top 3. That idea is not so hot – at best, it’s only an indicatorof LinkedIn’s penetration in an industry; at worst, it’s actually a contrarian indicator because it shows people who might want to transition OUT of that industry
  6. The next piece of data we can look at is the individual positions people list on their profiles – they have a start date and an industry, so you can see what industry people are flowing into in a given year. Much better.
  7. You run the numbers… Wait a second!! Is consulting really the hottest industry? Hmm.. I think the data is trying to lie to us. We need to take into account churn & promotions – and we do that by looking at the NET inflow of people into an industry: people coming IN minus people coming out.
  8. There, that should be much better. Next external factor that might come into play is seasonality. If we’re doing this analysis in the summer, it looks like there a lot fewer teachers and accountants, and a lot more summer interns compared to last year! So ideally, we want to compare the same time period to take out seasonal effects
  9. OK … done, let’s take another look. Are the Mining and metals & Dairy industries really the hottest industries this year? Or are they just very small industries on LinkedIn, and it’s much easier to grow off of a small base? You can get around this by making separate categories for industries of different size, ignoring industries below a certain size, or somehow account for that effect.
  10. Now, we’re done: got seasonality, thresholding, net inflow – this has to be the right question. Well, almost. We assumed the data is clean. And it’s not.
  11. For example , there are a lot of fake accounts that we’ve immediately closed, but they’re still there in the database. If you don’t check for that flag, you have this army of darthvaders boosting up the defense and space industry.
  12. Including the tail of a distribution might not make sense – do we want people who have 200 positions listed on their profile? They might throw off your data.
  13. We need to put the data under a microscope and understand what each flag, category and date means.OK, now we’ve accounted for external factors, took out the noise, are we ready to see some industry growth charts?!
  14. Hm, ok, we plot the YOY growth and we get something that looks like this : a spaghetti chart that mostly shows industries moving in unison – an effect of the broader economic conditions (see that dip in 2001 and 2009). If we want to actually focus on differences between industries instead of what they have in common, we need to scale or normalize those numbers – for example, by dividing the net # of people coming into an industry by the TOTAL number of people who started jobs that year. This also has the nice property that it accounts for website growth.
  15. OK, this MUST be it, right? The data stopped lying and we can actually see some real trends. Wild swings around 2000 for Internet and telecommunications, and there’s definitely something going on w/ real estate there. It still looks like spaghetti, it’s hard to understand and explain, and it’s not exactly telling a story. To tell the story, we need to make some hard decisions and pick only a couple of those lines, clean things up, and let that story shine.
  16. Nice! I’ve picked 3 industries – when the line is above zero, that industry is growing; below 0, it’s shrinking. So the Internet is taking off in 94, booming in 99, then there’s a huge dip in 2001. Real Estate is growing steadily, it’s picking up in 2002, and it’s sinking in 2008 – and so are financial services. This is all coming from aggregating data on people’s public LinkedIn profiles! This is the kind of story that gets people excited about the insights in the LinkedIn data – but it wouldn’t have been possible, if we didn’t ask the right questions.
  17. So let’s have some fun with the method I’ve just describe – let’s take a look at the growth of analytics and data science jobs over the past few years. Whoa! That rapid growth in the past 3 years is even more impressive when we realize that this is all properly normalized, not just the count of people with those titles on LinkedIn
  18. So next time you look at your data, don’t let it lie to you – account for external factors, take out the noise, and ask the right questions.