SlideShare uma empresa Scribd logo
1 de 38
HANDLING BIGGER DATA
What to do if your data’s too big
Data nerding
Your 5-7 things
❑ Bigger data
❑ Much bigger data
❑ Much bigger data storage
❑ Bigger data science teams
BIGGER DATA
Or, ‘data that’s a bit too big’
3
First, don’t panic
Computer storage
250Gb Internal hard drive. (hopefully)
permanent storage. The place you’re
storing photos, data etc
16Gb RAM. Temporary
storage. The place
read_csv loads your
dataset into.
2Tb External hard
drive. A handy place
to keep bigger
datafiles.
Gigabytes, Terabytes etc.
Name Size in bytes Contains (roughly)
Byte 1 1 character (‘a’, ‘1’ etc)
Kilobyte 1,000 Half a printed page
Megabyte 1,000,000 1 novella. 5Mb = complete works of Shakespeare
Gigabyte 1,000,000,000 1 high-fidelity symphony recording; 10m of shelved books
Terabyte 1,000,000,000,000 All the x-ray films in a large hospital; 10 = library of
congress collection. 2.6 = Panama Papers leak
Petabyte 1,000,000,000,000,000 2 = all US academic libraries; 10= 1 hour’s output from
SKA telescope
Exabyte 1,000,000,000,000,000,000 5 = all words ever spoken by humans
Zettabyte 1,000,000,000,000,000,000,000
Yottabyte 1,000,000,000,000,000,000,000,000 Current storage capacity of the Internet
Things to Try: Too Big
❑Read data in ‘chunks’
csv_chunks = pandas.read_csv(‘myfile.csv’, chunksize = 10000)
❑ Divide and conquer in your code:
csv_chunks = pandas.read_csv(‘myfile.csv’, skiprows=10000, chunksize = 10000)
❑Use parallel processing
❑ E.g the Dask library
Things to try: Too Slow
❑Use %timeit to find where the speed problems are
❑Use compiled python, (e.g. the Numba library)
❑Use C code (via Cython)
8
MUCH BIGGER DATA
Or, ‘What if it really doesn’t fit?’
9
Volume, Velocity, Variety
Much Faster Datastreams
Twitter firehose:
❑ Firehose averages 6,000 tweets per second
❑ Record is 143,199 tweets in one second (Aug 3rd 2013, Japan)
❑ Twitter public streams = 1% of Firehose steam
Google index (2013):
❑ 30 trillion unique pages on the internet
❑ Google index = 100 petabytes (100 million gigabytes)
❑ 100 billion web searches a month
❑ Search returned in about ⅛ second
Distributed systems
❑ Store the data on multiple ‘servers’:
❑ Big idea: Distributed file systems
❑ Replicate data (server hardware breaks more often than you think)
❑ Do the processing on multiple servers:
❑ Lots of code does the same thing to different pieces of data
❑ Big idea: Map/Reduce
Parallel Processors
❑Laptop: 4 cores, 16 GB RAM, 256 GB disk
❑Workstation: 24 cores, 1 TB RAM
❑Clusters: as big as you can imagine…
13
Distributed filesystems
Your typical rack server...
Map/Reduce: Crowdsourcing for computers
Distributed Programming Platforms
Hadoop
❑ HDFS: distributed filesystem
❑ MapReduce engine: processing
Spark
❑ In-memory processing
❑ Because moving data around is the biggest bottleneck
Typical (Current) Ecosystem
HDFS
Spark
Python
R
SQL
Tableau
Publisher
Data warehouse
Anaconda comes with this…
Parallel Python Libraries
❑ Dask
❑ Datasets look like NumpyArrays, Pandas DataFrames
❑ df.groupby(df.index).value.mean()
❑ Direct access into HDFS, S3 etc
❑ PySpark
❑ Also has DataFrames
❑ Connects to Spark
20
MUCH BIGGER DATA
STORAGE
Or, ‘Where do we put all this stuff?’
2
1
SQL Databases
❑ Row/column tables
❑ Keys
❑ SQL query language
❑ Joins etc (like Pandas)
ETL (Extract - Transform - Load)
❑ Extract
❑ Extract data from multiple sources
❑ Transform
❑ Convert data into database formats (e.g. sql)
❑ Load
❑ Load data into database
Data warehouses
NoSql Databases
❑ Not forced into row/column
❑ Lots of different types
❑ Key/value: can add feature without rewriting
tables
❑ Graph: stores nodes and edges
❑ Column: useful if you have a lot more reads
than writes
❑ Document: general-purpose. MongoDb is
commonly used.
Data Lakes
BIGGER DATA SCIENCE
TEAMS
Or, ‘Who does this stuff?’
2
7
Big Data Work
❑ Data Science
❑ Data Analysis
❑ Data Engineering
❑ Data Strategy
Big Data Science Teams
❑ Usually seen:
❑ Project manager
❑ Business analysts
❑ Data Scientists / Analysts: insight from data
❑ Data Engineers / Developers: data flow implementation, production systems
❑ Sometimes seen:
❑ Data Architect: data flow design
❑ User Experience / User Interface developer / Visual designer
Data Strategy
❑ Why should data be important here?
❑ Which business questions does this place have?
❑ What data does/could this place have access to?
❑ How much data work is already here?
❑ Who has the data science gene?
❑ What needs to change to make this place data-driven?
❑ People (training, culture)
❑ Processes
❑ Technologies (data access, storage, analysis tools)
❑ Data
Data Analysis
❑ What are the statistics of this dataset?
❑ E.g. which pages are popular
❑ Usually on already-formatted data, e.g. google analytics results
Data Science
❑ Ask an interesting question
❑ Get the data
❑ Explore the data
❑ Model the data
❑ Communicate and visualize your results
Data Engineering
❑ Big data storage
❑ SQL, NoSQL
❑ warehouses, lakes
❑ Cloud computing architectures
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ Big data analytics
❑ Distributed programming
platforms
❑ Privacy / security
❑ Uptime
❑ Maintenance
❑ etc.
EXERCISES
Or, ‘Trying some of this out’
3
4
Exercises
❑ Use pandas read_csv() to read a datafile in in chunks
LEARNING MORE
Or, ‘books’
3
6
READING
3
7
“Books are a
uniquely portable
magic” – Stephen
King
THANK YOU
sjterp@thoughtworks.com

Mais conteúdo relacionado

Mais procurados

Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
Lewis Crawford
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
Vignesh Prajapati
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
Natalino Busa
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Natalino Busa
 

Mais procurados (20)

Digital Contact's big data presentation to the University of Kent
Digital Contact's big data presentation to the University of KentDigital Contact's big data presentation to the University of Kent
Digital Contact's big data presentation to the University of Kent
 
Intro to Python for Data Science
Intro to Python for Data ScienceIntro to Python for Data Science
Intro to Python for Data Science
 
Big Data And Hadoop
Big Data And HadoopBig Data And Hadoop
Big Data And Hadoop
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Fundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and HadoopFundamentals of big data analytics and Hadoop
Fundamentals of big data analytics and Hadoop
 
Datawarehouse
DatawarehouseDatawarehouse
Datawarehouse
 
Big data processing system
Big data processing systemBig data processing system
Big data processing system
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Behind the scenes of data science
Behind the scenes of data scienceBehind the scenes of data science
Behind the scenes of data science
 
Why Hadoop is Useful?
Why Hadoop is Useful?Why Hadoop is Useful?
Why Hadoop is Useful?
 
Session 01 designing and scoping a data science project
Session 01 designing and scoping a data science projectSession 01 designing and scoping a data science project
Session 01 designing and scoping a data science project
 
Hadoop
HadoopHadoop
Hadoop
 
Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.Yo. big data. understanding data science in the era of big data.
Yo. big data. understanding data science in the era of big data.
 
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analyticsBig Data and APIs - a recon tour on how to successfully do Big Data analytics
Big Data and APIs - a recon tour on how to successfully do Big Data analytics
 
Big Data, Baby Steps
Big Data, Baby StepsBig Data, Baby Steps
Big Data, Baby Steps
 
Big data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup GroupBig data Intro - Presentation to OCHackerz Meetup Group
Big data Intro - Presentation to OCHackerz Meetup Group
 
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animationsRoots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
Roots tech 2013 Big Data at Ancestry (3-22-2013) - no animations
 
Top 10 data science technologies
Top 10 data science technologiesTop 10 data science technologies
Top 10 data science technologies
 

Semelhante a Session 10 handling bigger data

Semelhante a Session 10 handling bigger data (20)

Big Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with AzureBig Data Analytics: Finding diamonds in the rough with Azure
Big Data Analytics: Finding diamonds in the rough with Azure
 
Hadoop-2.6.0 Slides
Hadoop-2.6.0 SlidesHadoop-2.6.0 Slides
Hadoop-2.6.0 Slides
 
Big data-denis-rothman
Big data-denis-rothmanBig data-denis-rothman
Big data-denis-rothman
 
Column Stores and Google BigQuery
Column Stores and Google BigQueryColumn Stores and Google BigQuery
Column Stores and Google BigQuery
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Introduction to Big Data & Hadoop
Introduction to Big Data & HadoopIntroduction to Big Data & Hadoop
Introduction to Big Data & Hadoop
 
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
Hadoop Operations Powered By ... Hadoop (Hadoop Summit 2014 Amsterdam)
 
getFamiliarWithHadoop
getFamiliarWithHadoopgetFamiliarWithHadoop
getFamiliarWithHadoop
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Data analytics & its Trends
Data analytics & its TrendsData analytics & its Trends
Data analytics & its Trends
 
Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018Big Data Lakes Benchmarking 2018
Big Data Lakes Benchmarking 2018
 

Mais de bodaceacat

Ardrone represent
Ardrone representArdrone represent
Ardrone represent
bodaceacat
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
bodaceacat
 

Mais de bodaceacat (20)

CansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for MisinformationCansecWest2019: Infosec Frameworks for Misinformation
CansecWest2019: Infosec Frameworks for Misinformation
 
2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master2019 11 terp_breuer_disclosure_master
2019 11 terp_breuer_disclosure_master
 
Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019Terp breuer misinfosecframeworks_cansecwest2019
Terp breuer misinfosecframeworks_cansecwest2019
 
Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019Misinfosec frameworks Cansecwest 2019
Misinfosec frameworks Cansecwest 2019
 
Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019Sjterp ds_of_misinfo_feb_2019
Sjterp ds_of_misinfo_feb_2019
 
Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018Practical Influence Operations, presentation at Sofwerx Dec 2018
Practical Influence Operations, presentation at Sofwerx Dec 2018
 
Session 09 learning relationships.pptx
Session 09 learning relationships.pptxSession 09 learning relationships.pptx
Session 09 learning relationships.pptx
 
Session 08 geospatial data
Session 08 geospatial dataSession 08 geospatial data
Session 08 geospatial data
 
Session 07 text data.pptx
Session 07 text data.pptxSession 07 text data.pptx
Session 07 text data.pptx
 
Session 06 machine learning.pptx
Session 06 machine learning.pptxSession 06 machine learning.pptx
Session 06 machine learning.pptx
 
Session 05 cleaning and exploring
Session 05 cleaning and exploringSession 05 cleaning and exploring
Session 05 cleaning and exploring
 
Session 04 communicating results
Session 04 communicating resultsSession 04 communicating results
Session 04 communicating results
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 02 python basics
Session 02 python basicsSession 02 python basics
Session 02 python basics
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Gp technologybuilds july2011
Gp technologybuilds july2011Gp technologybuilds july2011
Gp technologybuilds july2011
 
Ardrone represent
Ardrone representArdrone represent
Ardrone represent
 
Global pulse app connection manager
Global pulse app connection managerGlobal pulse app connection manager
Global pulse app connection manager
 
Un Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian InnovationUn Pulse Camp - Humanitarian Innovation
Un Pulse Camp - Humanitarian Innovation
 
Blue light services
Blue light servicesBlue light services
Blue light services
 

Último

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
shambhavirathore45
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
shivangimorya083
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
JohnnyPlasten
 

Último (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Session 10 handling bigger data

  • 1. HANDLING BIGGER DATA What to do if your data’s too big Data nerding
  • 2. Your 5-7 things ❑ Bigger data ❑ Much bigger data ❑ Much bigger data storage ❑ Bigger data science teams
  • 3. BIGGER DATA Or, ‘data that’s a bit too big’ 3
  • 5. Computer storage 250Gb Internal hard drive. (hopefully) permanent storage. The place you’re storing photos, data etc 16Gb RAM. Temporary storage. The place read_csv loads your dataset into. 2Tb External hard drive. A handy place to keep bigger datafiles.
  • 6. Gigabytes, Terabytes etc. Name Size in bytes Contains (roughly) Byte 1 1 character (‘a’, ‘1’ etc) Kilobyte 1,000 Half a printed page Megabyte 1,000,000 1 novella. 5Mb = complete works of Shakespeare Gigabyte 1,000,000,000 1 high-fidelity symphony recording; 10m of shelved books Terabyte 1,000,000,000,000 All the x-ray films in a large hospital; 10 = library of congress collection. 2.6 = Panama Papers leak Petabyte 1,000,000,000,000,000 2 = all US academic libraries; 10= 1 hour’s output from SKA telescope Exabyte 1,000,000,000,000,000,000 5 = all words ever spoken by humans Zettabyte 1,000,000,000,000,000,000,000 Yottabyte 1,000,000,000,000,000,000,000,000 Current storage capacity of the Internet
  • 7. Things to Try: Too Big ❑Read data in ‘chunks’ csv_chunks = pandas.read_csv(‘myfile.csv’, chunksize = 10000) ❑ Divide and conquer in your code: csv_chunks = pandas.read_csv(‘myfile.csv’, skiprows=10000, chunksize = 10000) ❑Use parallel processing ❑ E.g the Dask library
  • 8. Things to try: Too Slow ❑Use %timeit to find where the speed problems are ❑Use compiled python, (e.g. the Numba library) ❑Use C code (via Cython) 8
  • 9. MUCH BIGGER DATA Or, ‘What if it really doesn’t fit?’ 9
  • 11. Much Faster Datastreams Twitter firehose: ❑ Firehose averages 6,000 tweets per second ❑ Record is 143,199 tweets in one second (Aug 3rd 2013, Japan) ❑ Twitter public streams = 1% of Firehose steam Google index (2013): ❑ 30 trillion unique pages on the internet ❑ Google index = 100 petabytes (100 million gigabytes) ❑ 100 billion web searches a month ❑ Search returned in about ⅛ second
  • 12. Distributed systems ❑ Store the data on multiple ‘servers’: ❑ Big idea: Distributed file systems ❑ Replicate data (server hardware breaks more often than you think) ❑ Do the processing on multiple servers: ❑ Lots of code does the same thing to different pieces of data ❑ Big idea: Map/Reduce
  • 13. Parallel Processors ❑Laptop: 4 cores, 16 GB RAM, 256 GB disk ❑Workstation: 24 cores, 1 TB RAM ❑Clusters: as big as you can imagine… 13
  • 15. Your typical rack server...
  • 17. Distributed Programming Platforms Hadoop ❑ HDFS: distributed filesystem ❑ MapReduce engine: processing Spark ❑ In-memory processing ❑ Because moving data around is the biggest bottleneck
  • 20. Parallel Python Libraries ❑ Dask ❑ Datasets look like NumpyArrays, Pandas DataFrames ❑ df.groupby(df.index).value.mean() ❑ Direct access into HDFS, S3 etc ❑ PySpark ❑ Also has DataFrames ❑ Connects to Spark 20
  • 21. MUCH BIGGER DATA STORAGE Or, ‘Where do we put all this stuff?’ 2 1
  • 22. SQL Databases ❑ Row/column tables ❑ Keys ❑ SQL query language ❑ Joins etc (like Pandas)
  • 23. ETL (Extract - Transform - Load) ❑ Extract ❑ Extract data from multiple sources ❑ Transform ❑ Convert data into database formats (e.g. sql) ❑ Load ❑ Load data into database
  • 25. NoSql Databases ❑ Not forced into row/column ❑ Lots of different types ❑ Key/value: can add feature without rewriting tables ❑ Graph: stores nodes and edges ❑ Column: useful if you have a lot more reads than writes ❑ Document: general-purpose. MongoDb is commonly used.
  • 27. BIGGER DATA SCIENCE TEAMS Or, ‘Who does this stuff?’ 2 7
  • 28. Big Data Work ❑ Data Science ❑ Data Analysis ❑ Data Engineering ❑ Data Strategy
  • 29. Big Data Science Teams ❑ Usually seen: ❑ Project manager ❑ Business analysts ❑ Data Scientists / Analysts: insight from data ❑ Data Engineers / Developers: data flow implementation, production systems ❑ Sometimes seen: ❑ Data Architect: data flow design ❑ User Experience / User Interface developer / Visual designer
  • 30. Data Strategy ❑ Why should data be important here? ❑ Which business questions does this place have? ❑ What data does/could this place have access to? ❑ How much data work is already here? ❑ Who has the data science gene? ❑ What needs to change to make this place data-driven? ❑ People (training, culture) ❑ Processes ❑ Technologies (data access, storage, analysis tools) ❑ Data
  • 31. Data Analysis ❑ What are the statistics of this dataset? ❑ E.g. which pages are popular ❑ Usually on already-formatted data, e.g. google analytics results
  • 32. Data Science ❑ Ask an interesting question ❑ Get the data ❑ Explore the data ❑ Model the data ❑ Communicate and visualize your results
  • 33. Data Engineering ❑ Big data storage ❑ SQL, NoSQL ❑ warehouses, lakes ❑ Cloud computing architectures ❑ Privacy / security ❑ Uptime ❑ Maintenance ❑ Big data analytics ❑ Distributed programming platforms ❑ Privacy / security ❑ Uptime ❑ Maintenance ❑ etc.
  • 34. EXERCISES Or, ‘Trying some of this out’ 3 4
  • 35. Exercises ❑ Use pandas read_csv() to read a datafile in in chunks
  • 37. READING 3 7 “Books are a uniquely portable magic” – Stephen King