SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
When Big Data Meet Python

                             Jimmy Lai (賴弘哲)
                           jimmy.lai@oi-sys.com
                                2012/08/19
Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python


                          2012
 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
                                                                                                                            1
自我介紹
• 賴弘哲 (Jimmy Lai)
• Interests: Data mining, Machine Learning,
  Natural Language Processing, Distributed
  Computing, Python
• LindedIn profile: http://goo.gl/XTEM5
• 現任職於引京聚點知識結構搜索公司,
  從事大資料語意分析


            2012                              2
Outline
1. Big Data
  a. Concept
  b. Technical issues
2. Big Data + Python
  a. Related open source tools
  b. Example




              2012               3
Benefits of Big Data
1. Creating transparency(透明度) e.g. http://www.data.gov/
2. Enabling experimentation to discover needs,
   expose variability, and improve
   performance(發現需求及潛在威脅、改善產能)
3. Segmenting populations to customize(客製化)
   actions
4. Replacing/supporting human decision making
   with automated algorithms(自動決策)
5. Innovating new business models, products and
   services(創新的服務、產業)
深度資料分析人才的短缺               (May 2011). Big Data: The next frontier for
                          innovation, competition, and productivity.
              2012        McKinsey Global Institute.                    4
Initiative from the White House
• (Mar 2012) Big Data Research and
  Development Initiative, the White House.
• National Science Foundation encourages
  education on Big Data.
• Government invest on developing state-of-
  the-art technologies, harness those
  technologies, and expand the workforce for
  Big Data.

            2012                               5
Big Data Issues
User Generated Content              Machine Generated Data



                         Collecting

                         Storage

                     Computing

                         Analysis

                    Visualization
          2012                                               6
Big Data Techniques
                     Machine
User Generated
   Content        Generated Data   • Crawler
                                     – Collect raw data
           Collecting                – E.g. Heritrix, Nutch
                                   • Scraping
            Storage                  – Parse information
           Computing
                                       from raw data
                                     – E.g. Yahoo! Pipes,
            Analysis                   Scrapy

          Visualization
                   2012                                       7
Big Data Techniques
User Generated       Machine
                  Generated Data
                                   • Big Table
   Content
                                     – Distributed key-value
                                       storage
           Collecting                – E.g.Hbase, Cassandra
                                   • NoSQL
            Storage                  – Not use SQL for
                                       manipulation
           Computing                 – Not use relational
                                       database model
            Analysis                 – E.g. MongoDB, Redis,
                                       CouchDB
          Visualization
                   2012                                    8
Big Data Techniques
                     Machine
User Generated
   Content        Generated Data   • Batch
                                     – MapReduce
           Collecting                – E.g. Hadoop
                                   • Real-time
            Storage                  – Stream processing
           Computing                 – E.g. S4, Storm

            Analysis

          Visualization
                   2012                                    9
Big Data Techniques
User Generated       Machine       • Data mining
   Content        Generated Data
                                      – Weka
                                   • Machine learning
           Collecting                 – scikit-learn
                                   • Natural language
            Storage                  processing
                                      – NLTK, Stanford NLP
           Computing               • Statistics
                                      –R
            Analysis

          Visualization
                   2012                                      10
Big Data Techniques
                     Machine
User Generated
   Content        Generated Data   • Abstract
                                   • Interactive
           Collecting              • E.g. Processing,
                                     Gephi, D3.js
            Storage

           Computing

            Analysis

          Visualization
                   2012                                 11
Why Python?
• Good code readability     • Fast growing among
  for fast development.       open source
• Scripting language: the     communities.
  less code, the more         – Commits statistics from
  productivity.                 ohloh.net




              2012                                        12
When Big Data meet Python
        User Generated       Machine
           Content        Generated Data



                   Collecting              Scrapy: scraping framework


                                       PyMongo: Python client for Mongodb
Infrastructure




                    Storage
                                       Hadoop streaming: Linux pipe interface
                   Computing           Disco: lightweight MapReduce in Python
                                       Pandas: data analysis/manipulation
                    Analysis           Statsmodels: statistics
                                       NLTK: natural language processing
                                       Scikit-learn: machine learning
                  Visualization        Matplotlib: plotting
                           2012        NetworkX: graph visualization            13
When Big Data meet Python
User Generated       Machine
                  Generated Data                            http://scrapy.org/
   Content
                                   web scraping framework
                                   • Simple and Extensible
           Collecting
                                   • Components:
                                      •   Scheduler
            Storage                   •   Downloader
                                      •   Spider(Scraper)
           Computing                  •   Item pipeline

            Analysis

          Visualization
                   2012                                                   14
When Big Data meet Python
User Generated       Machine
                                                       http://www.mongodb.org/
   Content        Generated Data
                                   NoSQL database
                                   • PyMongo: client for python
           Collecting
                                   • Document(JSON)-oriented
                                   • No schema
            Storage
                                   • Scalable
                                     • Auto-sharding
           Computing
                                     • Replica-set

            Analysis               • File storage
                                   • MapReduce aggregation
          Visualization
                   2012                                                15
When Big Data meet Python
                     Machine                           http://discoproject.org/
User Generated
   Content        Generated Data
                                   • Distributed computing:
                                      – MapReduce
           Collecting                 – Disco distributed file system
                                   • Write code in Python
            Storage                   – Easy/fast to profiling
                                      – Easy/fast to debugging
           Computing

            Analysis

          Visualization
                   2012                                                    16
When Big Data meet Python
User Generated       Machine
   Content        Generated Data
                                                     http://pandas.pydata.org/

                                   • Data analysis library
           Collecting              • Datastructure for fast data
                                     manipulation
                                      – Slicing
            Storage
                                      – Indexing
                                      – subsetting
           Computing
                                   • Handling missing data
            Analysis               • Aggregation
                                   • Time series
          Visualization
                   2012                                                     17
When Big Data meet Python
User Generated       Machine               Statsmodels
   Content        Generated Data           http://statsmodels.sourceforge.net/

                                   • Statistical analysis
           Collecting                • Statistical models
                                     • Fit data with model
            Storage                  • Statistical tests
                                     • Data exploration
           Computing                 • Time series analysis

            Analysis

          Visualization
                   2012                                                      18
When Big Data meet Python
User Generated       Machine                      scikit-learn
   Content        Generated Data                  http://scikit-learn.org/

                                   •   Machine learning algorithms
                                   •   Supervised learning
           Collecting
                                   •   Unsupervised learning
                                   •   Dataset
            Storage
                                       • Preprocessing
           Computing                   • feature extraction
                                   • Model
            Analysis                   • Selection
                                       • Pipeline
          Visualization
                   2012                                                      19
When Big Data meet Python
User Generated       Machine
   Content        Generated Data   NLTK: Natural Language Toolkit
                                                       http://scikit-learn.org/

                                   • Natural language processing
           Collecting              • Annotated corpora and resources
                                      Information Extraction Work Flow


            Storage                    Sentence
                                     Segmentation
                                                      Tokenization       POS tagging




           Computing                 Named Entity      Relation
                                      Recognition     Recognition



            Analysis

          Visualization
                   2012                                                            20
When Big Data meet Python
User Generated       Machine
   Content        Generated Data   NL
                                               http://matplotlib.sourceforge.net/

                                   • Plotting
           Collecting                 – Histograms
                                      – Power spectra
            Storage                   – Bar charts
                                      – Error charts
           Computing                  – Scatter plots
                                   • Full control to detail of plotting
            Analysis

          Visualization
                   2012                                                       21
When Big Data meet Python
User Generated       Machine
   Content        Generated Data   NetworkX http://networkx.lanl.gov/
                                   • Graph algorithms and
                                     visisualization
           Collecting
                                   • Draw graph with layout:
                                       –   Circular
            Storage                    –   Random
                                       –   Spectural
           Computing                   –   Spring
                                       –   Shell
            Analysis                   –   Graphviz


          Visualization
                   2012                                                 22
聚寶評 www.ezpao.com

      美食搜尋引擎




搜尋各大部落格食記

  2012              23
聚寶評 www.ezpao.com

     語意分析搜尋引擎




  2012              24
評論主題分析




  網友分享菜分析




   正評/負評分析




2012                  25
Thank you for your attention.
           Q&A
We are hiring!
• 核心引擎演算法研發工程師
• 系統研發工程師
• 網路應用研發工程師

Oxygen Intelligence Taiwan Limited
引京聚點 知識結構搜索股份有限公司
• 公司簡介: http://www.ezpao.com/about/
• 職缺簡介: http://www.ezpao.com/join/
• 請將履歷寄到 jimmy.lai@oi-sys.com

                              2012
     When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
                                                                                                                                26

Mais conteúdo relacionado

Mais procurados

Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseJihoon Son
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysCAPSiDE
 
An introduction to U1db
An introduction to U1dbAn introduction to U1db
An introduction to U1dbDavid Planella
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsStefan Urbanek
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureLuiz Henrique Zambom Santana
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveAkamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveRick Viscomi
 
Tracking the Performance of the Web with HTTP Archive
Tracking the Performance of the Web with HTTP ArchiveTracking the Performance of the Web with HTTP Archive
Tracking the Performance of the Web with HTTP ArchiveRick Viscomi
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioWinston Chen
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Roy Russo
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLMongoDB
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo dbMongoDB
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...MongoDB
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshopMathieu Elie
 

Mais procurados (20)

MongoDB and Python
MongoDB and PythonMongoDB and Python
MongoDB and Python
 
Introduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data WarehouseIntroduction to Apache Tajo: Future of Data Warehouse
Introduction to Apache Tajo: Future of Data Warehouse
 
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDaysConexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
Conexión de MongoDB con Hadoop - Luis Alberto Giménez - CAPSiDE #DevOSSAzureDays
 
An introduction to U1db
An introduction to U1dbAn introduction to U1db
An introduction to U1db
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Design of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore ArchitectureDesign of Experiments on Federator Polystore Architecture
Design of Experiments on Federator Polystore Architecture
 
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP ArchiveAkamai Edge: Tracking the Performance of the Web with HTTP Archive
Akamai Edge: Tracking the Performance of the Web with HTTP Archive
 
Tracking the Performance of the Web with HTTP Archive
Tracking the Performance of the Web with HTTP ArchiveTracking the Performance of the Web with HTTP Archive
Tracking the Performance of the Web with HTTP Archive
 
Python and MongoDB
Python and MongoDB Python and MongoDB
Python and MongoDB
 
MongoDB
MongoDBMongoDB
MongoDB
 
Data Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudioData Science Stack with MongoDB and RStudio
Data Science Stack with MongoDB and RStudio
 
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis PipelinesMongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
MongoDB Days UK: Using MongoDB and Python for Data Analysis Pipelines
 
Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015Elasticsearch - DevNexus 2015
Elasticsearch - DevNexus 2015
 
Back to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQLBack to Basics Webinar 1: Introduction to NoSQL
Back to Basics Webinar 1: Introduction to NoSQL
 
R statistics with mongo db
R statistics with mongo dbR statistics with mongo db
R statistics with mongo db
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
MongoDB
MongoDBMongoDB
MongoDB
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
elasticsearch basics workshop
elasticsearch basics workshopelasticsearch basics workshop
elasticsearch basics workshop
 

Destaque

Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profitFederico Feroldi
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source toolsSammy Fung
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Sammy Fung
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...Anton
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr UsageJimmy Lai
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookJimmy Lai
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst NanodegreeJimmy Lai
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast PrototypingJimmy Lai
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in pythonJimmy Lai
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHugJimmy Lai
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesJimmy Lai
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge BaseJimmy Lai
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHugJimmy Lai
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugJimmy Lai
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learnJimmy Lai
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 

Destaque (19)

Crawling the web for fun and profit
Crawling the web for fun and profitCrawling the web for fun and profit
Crawling the web for fun and profit
 
Collecting web information with open source tools
Collecting web information with open source toolsCollecting web information with open source tools
Collecting web information with open source tools
 
Scrapy.for.dummies
Scrapy.for.dummiesScrapy.for.dummies
Scrapy.for.dummies
 
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
Web scraping 1 2-3 with python + scrapy (Summer BarCampHK 2012 version)
 
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
How to Scrap Any Website's content using ScrapyTutorial of How to scrape (cra...
 
摘星
摘星摘星
摘星
 
[LDSP] Solr Usage
[LDSP] Solr Usage[LDSP] Solr Usage
[LDSP] Solr Usage
 
Fast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython NotebookFast data mining flow prototyping using IPython Notebook
Fast data mining flow prototyping using IPython Notebook
 
Data Analyst Nanodegree
Data Analyst NanodegreeData Analyst Nanodegree
Data Analyst Nanodegree
 
[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping[LDSP] Search Engine Back End API Solution for Fast Prototyping
[LDSP] Search Engine Back End API Solution for Fast Prototyping
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Software development practices in python
Software development practices in pythonSoftware development practices in python
Software development practices in python
 
Documentation with sphinx @ PyHug
Documentation with sphinx @ PyHugDocumentation with sphinx @ PyHug
Documentation with sphinx @ PyHug
 
Apache thrift-RPC service cross languages
Apache thrift-RPC service cross languagesApache thrift-RPC service cross languages
Apache thrift-RPC service cross languages
 
Build a Searchable Knowledge Base
Build a Searchable Knowledge BaseBuild a Searchable Knowledge Base
Build a Searchable Knowledge Base
 
Nltk natural language toolkit overview and application @ PyHug
Nltk  natural language toolkit overview and application @ PyHugNltk  natural language toolkit overview and application @ PyHug
Nltk natural language toolkit overview and application @ PyHug
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
 
Text classification in scikit-learn
Text classification in scikit-learnText classification in scikit-learn
Text classification in scikit-learn
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 

Semelhante a When big data meet python @ COSCUP 2012

Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinerySteve Loughran
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranJAX London
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven businessOpenDataSoft
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesVishy Poosala
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingm_hepburn
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012MongoDB
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-HadoopNagarjuna D.N
 
Rails with MongoDB
Rails with MongoDBRails with MongoDB
Rails with MongoDBEugene Park
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousingDataWorks Summit
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big DecisionsInnoTech
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...BigMine
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsGDi Techno Solutions
 
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMEGigaom
 
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...Ohud Saud
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server ProLynn Langit
 

Semelhante a When big data meet python @ COSCUP 2012 (20)

Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
 
Hadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve LoughranHadoop as Data Refinery - Steve Loughran
Hadoop as Data Refinery - Steve Loughran
 
From open data to API-driven business
From open data to API-driven businessFrom open data to API-driven business
From open data to API-driven business
 
NoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, OpportunitiesNoSQL & Big Data Analytics: History, Hype, Opportunities
NoSQL & Big Data Analytics: History, Hype, Opportunities
 
Apache hadoop bigdata-in-banking
Apache hadoop bigdata-in-bankingApache hadoop bigdata-in-banking
Apache hadoop bigdata-in-banking
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012Getting Started with MongoDB at Oracle Open World 2012
Getting Started with MongoDB at Oracle Open World 2012
 
Pass bac jd_sm
Pass bac jd_smPass bac jd_sm
Pass bac jd_sm
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
Rails with MongoDB
Rails with MongoDBRails with MongoDB
Rails with MongoDB
 
Ibm db2update2019 icp4 data
Ibm db2update2019   icp4 dataIbm db2update2019   icp4 data
Ibm db2update2019 icp4 data
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Using hadoop to expand data warehousing
Using hadoop to expand data warehousingUsing hadoop to expand data warehousing
Using hadoop to expand data warehousing
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
 
Data mining - GDi Techno Solutions
Data mining - GDi Techno SolutionsData mining - GDi Techno Solutions
Data mining - GDi Techno Solutions
 
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUMETHE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
THE 3V’S OF BIG DATA: VARIETY, VELOCITY, and VOLUME
 
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
Scaling big data mining infrastructure thetwitte experience - Jimmy Lin and D...
 
NoSQL for the SQL Server Pro
NoSQL for the SQL Server ProNoSQL for the SQL Server Pro
NoSQL for the SQL Server Pro
 

Mais de Jimmy Lai

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdfJimmy Lai
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesEuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesJimmy Lai
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringJimmy Lai
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramJimmy Lai
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Jimmy Lai
 

Mais de Jimmy Lai (6)

Python Linters at Scale.pdf
Python Linters at Scale.pdfPython Linters at Scale.pdf
Python Linters at Scale.pdf
 
EuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python CodebasesEuroPython 2022 - Automated Refactoring Large Python Codebases
EuroPython 2022 - Automated Refactoring Large Python Codebases
 
Annotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoringAnnotate types in large codebase with automated refactoring
Annotate types in large codebase with automated refactoring
 
The journey of asyncio adoption in instagram
The journey of asyncio adoption in instagramThe journey of asyncio adoption in instagram
The journey of asyncio adoption in instagram
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...Continuous Delivery: automated testing, continuous integration and continuous...
Continuous Delivery: automated testing, continuous integration and continuous...
 

Último

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 

Último (20)

Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 

When big data meet python @ COSCUP 2012

  • 1. When Big Data Meet Python Jimmy Lai (賴弘哲) jimmy.lai@oi-sys.com 2012/08/19 Slides: http://www.slideshare.net/jimmy_lai/when-big-data-meet-python 2012 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. 1
  • 2. 自我介紹 • 賴弘哲 (Jimmy Lai) • Interests: Data mining, Machine Learning, Natural Language Processing, Distributed Computing, Python • LindedIn profile: http://goo.gl/XTEM5 • 現任職於引京聚點知識結構搜索公司, 從事大資料語意分析 2012 2
  • 3. Outline 1. Big Data a. Concept b. Technical issues 2. Big Data + Python a. Related open source tools b. Example 2012 3
  • 4. Benefits of Big Data 1. Creating transparency(透明度) e.g. http://www.data.gov/ 2. Enabling experimentation to discover needs, expose variability, and improve performance(發現需求及潛在威脅、改善產能) 3. Segmenting populations to customize(客製化) actions 4. Replacing/supporting human decision making with automated algorithms(自動決策) 5. Innovating new business models, products and services(創新的服務、產業) 深度資料分析人才的短缺 (May 2011). Big Data: The next frontier for innovation, competition, and productivity. 2012 McKinsey Global Institute. 4
  • 5. Initiative from the White House • (Mar 2012) Big Data Research and Development Initiative, the White House. • National Science Foundation encourages education on Big Data. • Government invest on developing state-of- the-art technologies, harness those technologies, and expand the workforce for Big Data. 2012 5
  • 6. Big Data Issues User Generated Content Machine Generated Data Collecting Storage Computing Analysis Visualization 2012 6
  • 7. Big Data Techniques Machine User Generated Content Generated Data • Crawler – Collect raw data Collecting – E.g. Heritrix, Nutch • Scraping Storage – Parse information Computing from raw data – E.g. Yahoo! Pipes, Analysis Scrapy Visualization 2012 7
  • 8. Big Data Techniques User Generated Machine Generated Data • Big Table Content – Distributed key-value storage Collecting – E.g.Hbase, Cassandra • NoSQL Storage – Not use SQL for manipulation Computing – Not use relational database model Analysis – E.g. MongoDB, Redis, CouchDB Visualization 2012 8
  • 9. Big Data Techniques Machine User Generated Content Generated Data • Batch – MapReduce Collecting – E.g. Hadoop • Real-time Storage – Stream processing Computing – E.g. S4, Storm Analysis Visualization 2012 9
  • 10. Big Data Techniques User Generated Machine • Data mining Content Generated Data – Weka • Machine learning Collecting – scikit-learn • Natural language Storage processing – NLTK, Stanford NLP Computing • Statistics –R Analysis Visualization 2012 10
  • 11. Big Data Techniques Machine User Generated Content Generated Data • Abstract • Interactive Collecting • E.g. Processing, Gephi, D3.js Storage Computing Analysis Visualization 2012 11
  • 12. Why Python? • Good code readability • Fast growing among for fast development. open source • Scripting language: the communities. less code, the more – Commits statistics from productivity. ohloh.net 2012 12
  • 13. When Big Data meet Python User Generated Machine Content Generated Data Collecting Scrapy: scraping framework PyMongo: Python client for Mongodb Infrastructure Storage Hadoop streaming: Linux pipe interface Computing Disco: lightweight MapReduce in Python Pandas: data analysis/manipulation Analysis Statsmodels: statistics NLTK: natural language processing Scikit-learn: machine learning Visualization Matplotlib: plotting 2012 NetworkX: graph visualization 13
  • 14. When Big Data meet Python User Generated Machine Generated Data http://scrapy.org/ Content web scraping framework • Simple and Extensible Collecting • Components: • Scheduler Storage • Downloader • Spider(Scraper) Computing • Item pipeline Analysis Visualization 2012 14
  • 15. When Big Data meet Python User Generated Machine http://www.mongodb.org/ Content Generated Data NoSQL database • PyMongo: client for python Collecting • Document(JSON)-oriented • No schema Storage • Scalable • Auto-sharding Computing • Replica-set Analysis • File storage • MapReduce aggregation Visualization 2012 15
  • 16. When Big Data meet Python Machine http://discoproject.org/ User Generated Content Generated Data • Distributed computing: – MapReduce Collecting – Disco distributed file system • Write code in Python Storage – Easy/fast to profiling – Easy/fast to debugging Computing Analysis Visualization 2012 16
  • 17. When Big Data meet Python User Generated Machine Content Generated Data http://pandas.pydata.org/ • Data analysis library Collecting • Datastructure for fast data manipulation – Slicing Storage – Indexing – subsetting Computing • Handling missing data Analysis • Aggregation • Time series Visualization 2012 17
  • 18. When Big Data meet Python User Generated Machine Statsmodels Content Generated Data http://statsmodels.sourceforge.net/ • Statistical analysis Collecting • Statistical models • Fit data with model Storage • Statistical tests • Data exploration Computing • Time series analysis Analysis Visualization 2012 18
  • 19. When Big Data meet Python User Generated Machine scikit-learn Content Generated Data http://scikit-learn.org/ • Machine learning algorithms • Supervised learning Collecting • Unsupervised learning • Dataset Storage • Preprocessing Computing • feature extraction • Model Analysis • Selection • Pipeline Visualization 2012 19
  • 20. When Big Data meet Python User Generated Machine Content Generated Data NLTK: Natural Language Toolkit http://scikit-learn.org/ • Natural language processing Collecting • Annotated corpora and resources Information Extraction Work Flow Storage Sentence Segmentation Tokenization POS tagging Computing Named Entity Relation Recognition Recognition Analysis Visualization 2012 20
  • 21. When Big Data meet Python User Generated Machine Content Generated Data NL http://matplotlib.sourceforge.net/ • Plotting Collecting – Histograms – Power spectra Storage – Bar charts – Error charts Computing – Scatter plots • Full control to detail of plotting Analysis Visualization 2012 21
  • 22. When Big Data meet Python User Generated Machine Content Generated Data NetworkX http://networkx.lanl.gov/ • Graph algorithms and visisualization Collecting • Draw graph with layout: – Circular Storage – Random – Spectural Computing – Spring – Shell Analysis – Graphviz Visualization 2012 22
  • 23. 聚寶評 www.ezpao.com 美食搜尋引擎 搜尋各大部落格食記 2012 23
  • 24. 聚寶評 www.ezpao.com 語意分析搜尋引擎 2012 24
  • 25. 評論主題分析 網友分享菜分析 正評/負評分析 2012 25
  • 26. Thank you for your attention. Q&A We are hiring! • 核心引擎演算法研發工程師 • 系統研發工程師 • 網路應用研發工程師 Oxygen Intelligence Taiwan Limited 引京聚點 知識結構搜索股份有限公司 • 公司簡介: http://www.ezpao.com/about/ • 職缺簡介: http://www.ezpao.com/join/ • 請將履歷寄到 jimmy.lai@oi-sys.com 2012 When big data meet python by Jimmy Lai is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. 26