Big Data consists of several issues: data collecting, storage, computing, analysis and visualization. Python is a popular scripting language with good code readability and thus is suitable for fast development. In this slides, the author shares how to solve Big Data issues using Python open source tools.
1. When Big Data Meet Python
Jimmy Lai (賴弘哲)
jimmy.lai@oi-sys.com
2012/08/19
Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python
2012
When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
1
3. Outline
1. Big Data
a. Concept
b. Technical issues
2. Big Data + Python
a. Related open source tools
b. Example
2012 3
4. Benefits of Big Data
1. Creating transparency(透明度) e.g. http://www.data.gov/
2. Enabling experimentation to discover needs,
expose variability, and improve
performance(發現需求及潛在威脅、改善產能)
3. Segmenting populations to customize(客製化)
actions
4. Replacing/supporting human decision making
with automated algorithms(自動決策)
5. Innovating new business models, products and
services(創新的服務、產業)
深度資料分析人才的短缺 (May 2011). Big Data: The next frontier for
innovation, competition, and productivity.
2012 McKinsey Global Institute. 4
5. Initiative from the White House
• (Mar 2012) Big Data Research and
Development Initiative, the White House.
• National Science Foundation encourages
education on Big Data.
• Government invest on developing state-of-
the-art technologies, harness those
technologies, and expand the workforce for
Big Data.
2012 5
6. Big Data Issues
User Generated Content Machine Generated Data
Collecting
Storage
Computing
Analysis
Visualization
2012 6
7. Big Data Techniques
Machine
User Generated
Content Generated Data • Crawler
– Collect raw data
Collecting – E.g. Heritrix, Nutch
• Scraping
Storage – Parse information
Computing
from raw data
– E.g. Yahoo! Pipes,
Analysis Scrapy
Visualization
2012 7
8. Big Data Techniques
User Generated Machine
Generated Data
• Big Table
Content
– Distributed key-value
storage
Collecting – E.g.Hbase, Cassandra
• NoSQL
Storage – Not use SQL for
manipulation
Computing – Not use relational
database model
Analysis – E.g. MongoDB, Redis,
CouchDB
Visualization
2012 8
9. Big Data Techniques
Machine
User Generated
Content Generated Data • Batch
– MapReduce
Collecting – E.g. Hadoop
• Real-time
Storage – Stream processing
Computing – E.g. S4, Storm
Analysis
Visualization
2012 9
10. Big Data Techniques
User Generated Machine • Data mining
Content Generated Data
– Weka
• Machine learning
Collecting – scikit-learn
• Natural language
Storage processing
– NLTK, Stanford NLP
Computing • Statistics
–R
Analysis
Visualization
2012 10
11. Big Data Techniques
Machine
User Generated
Content Generated Data • Abstract
• Interactive
Collecting • E.g. Processing,
Gephi, D3.js
Storage
Computing
Analysis
Visualization
2012 11
12. Why Python?
• Good code readability • Fast growing among
for fast development. open source
• Scripting language: the communities.
less code, the more – Commits statistics from
productivity. ohloh.net
2012 12
13. When Big Data meet Python
User Generated Machine
Content Generated Data
Collecting Scrapy: scraping framework
PyMongo: Python client for Mongodb
Infrastructure
Storage
Hadoop streaming: Linux pipe interface
Computing Disco: lightweight MapReduce in Python
Pandas: data analysis/manipulation
Analysis Statsmodels: statistics
NLTK: natural language processing
Scikit-learn: machine learning
Visualization Matplotlib: plotting
2012 NetworkX: graph visualization 13
14. When Big Data meet Python
User Generated Machine
Generated Data http://scrapy.org/
Content
web scraping framework
• Simple and Extensible
Collecting
• Components:
• Scheduler
Storage • Downloader
• Spider(Scraper)
Computing • Item pipeline
Analysis
Visualization
2012 14
15. When Big Data meet Python
User Generated Machine
http://www.mongodb.org/
Content Generated Data
NoSQL database
• PyMongo: client for python
Collecting
• Document(JSON)-oriented
• No schema
Storage
• Scalable
• Auto-sharding
Computing
• Replica-set
Analysis • File storage
• MapReduce aggregation
Visualization
2012 15
16. When Big Data meet Python
Machine http://discoproject.org/
User Generated
Content Generated Data
• Distributed computing:
– MapReduce
Collecting – Disco distributed file system
• Write code in Python
Storage – Easy/fast to profiling
– Easy/fast to debugging
Computing
Analysis
Visualization
2012 16
17. When Big Data meet Python
User Generated Machine
Content Generated Data
http://pandas.pydata.org/
• Data analysis library
Collecting • Datastructure for fast data
manipulation
– Slicing
Storage
– Indexing
– subsetting
Computing
• Handling missing data
Analysis • Aggregation
• Time series
Visualization
2012 17
18. When Big Data meet Python
User Generated Machine Statsmodels
Content Generated Data http://statsmodels.sourceforge.net/
• Statistical analysis
Collecting • Statistical models
• Fit data with model
Storage • Statistical tests
• Data exploration
Computing • Time series analysis
Analysis
Visualization
2012 18
19. When Big Data meet Python
User Generated Machine scikit-learn
Content Generated Data http://scikit-learn.org/
• Machine learning algorithms
• Supervised learning
Collecting
• Unsupervised learning
• Dataset
Storage
• Preprocessing
Computing • feature extraction
• Model
Analysis • Selection
• Pipeline
Visualization
2012 19
20. When Big Data meet Python
User Generated Machine
Content Generated Data NLTK: Natural Language Toolkit
http://scikit-learn.org/
• Natural language processing
Collecting • Annotated corpora and resources
Information Extraction Work Flow
Storage Sentence
Segmentation
Tokenization POS tagging
Computing Named Entity Relation
Recognition Recognition
Analysis
Visualization
2012 20
21. When Big Data meet Python
User Generated Machine
Content Generated Data NL
http://matplotlib.sourceforge.net/
• Plotting
Collecting – Histograms
– Power spectra
Storage – Bar charts
– Error charts
Computing – Scatter plots
• Full control to detail of plotting
Analysis
Visualization
2012 21
22. When Big Data meet Python
User Generated Machine
Content Generated Data NetworkX http://networkx.lanl.gov/
• Graph algorithms and
visisualization
Collecting
• Draw graph with layout:
– Circular
Storage – Random
– Spectural
Computing – Spring
– Shell
Analysis – Graphviz
Visualization
2012 22
26. Thank you for your attention.
Q&A
We are hiring!
• 核心引擎演算法研發工程師
• 系統研發工程師
• 網路應用研發工程師
Oxygen Intelligence Taiwan Limited
引京聚點 知識結構搜索股份有限公司
• 公司簡介: http://www.ezpao.com/about/
• 職缺簡介: http://www.ezpao.com/join/
• 請將履歷寄到 jimmy.lai@oi-sys.com
2012
When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
26