SlideShare uma empresa Scribd logo
1 de 58
Baixar para ler offline
Python for Business
        Intelligence


Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
python business intelligence




                )
Results

Q/A and articles with Java
  solution references


               (not listed here)
Why?
Overview

■ Traditional Data Warehouse
■ Python and Data
■ Is Python Capable?
■ Conclusion
Business
Intelligence
people

technology processes
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Traditional Data
  Warehouse
■ Extracting data from the original sources

■ Quality assuring and cleaning data

■ Conforming the labels and measures
   in the data to achieve consistency across the original sources



■ Delivering data in a physical format that can be used by
   query tools, report writers, and dashboards.




                         Source: Ralph Kimball – The Data Warehouse ETL Toolkit
Source               Staging Area     Operational Data Store   Datamarts
Systems



   structured
   documents




   databases

                Temporary
                Staging
                Area
      APIs




                            staging              relational        dimensional

                             L0                    L1                 L2
real time = daily
Multi-dimensional
    Modeling
aggregation browsing
     slicing and dicing
business / analyst’s
       point of view

regardless of physical schema implementation
Facts

                  measurable


     fact

                   fact data cell




most detailed information
location




type




              time



           dimensions
Dimension

■ provide context for facts
■ used to filter queries or reports
■ control scope of aggregation of facts
Pentaho
Python and Data
   community perception*




                           *as of Oct 2012
Scientific & Financial
Python
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Scientific Data
      T1[s]     T2[s]     T3[s]     T4[s]
P1     112,68    941,67    171,01    660,48

P2      96,15    306,51    725,88    877,82

P3     313,39    189,31     41,81    428,68

P4     760,62    983,48    371,21    281,19

P5     838,56     39,27    389,42    231,12




     n-dimensional array of numbers
Assumptions

■ data is mostly numbers
■ data is neatly organized...
■ … in one multi-dimensional array
Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Business Data
multiple snapshots of one source




multiple representations     categories are

     of same data                  changing
❄
Is Python Capable?
     very basic examples
Data Pipes with
   SQLAlchemy

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
■ connection: create_engine
■ schema reflection: MetaData,   Table

■ expressions: select(),   insert()
src_engine = create_engine("sqlite:///data.sqlite")
src_metadata = MetaData(bind=src_engine)
src_table = Table('data', src_metadata, autoload=True)




target_engine = create_engine("postgres://localhost/sandbox")
target_metadata = MetaData(bind=target_engine)
target_table = Table('data', target_metadata)
clone schema:

for column in src_table.columns:
    target_table.append_column(column.copy())

target_table.create()




copy data:

insert = target_table.insert()

for row in src_table.select().execute():
    insert.execute(row)
magic used:

metadata reflection
text file (CSV) to table:




reader = csv.reader(file_stream)

columns = reader.next()

for column in columns:
    table.append_column(Column(column, String))

table.create()

for row in reader:
    insert.execute(row)
Simple T from ETL

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
transformation = [

 ('fiscal_year',         {"w function": int,
                          ". field":"fiscal_year"}),
 ('region_code',         {"4 mapping": region_map,
                          ". field":"region"}),
 ('borrower_country',    None),
 ('project_name',        None),
 ('procurement_type',    None),
 ('major_sector_code',   {"4 mapping": sector_code_map,
                          ". field":"major_sector"}),
 ('major_sector',        None),
 ('supplier',            None),
 ('contract_amount',     {"w function": currency_to_number,
                          ". field": 'total_contract_amount'}
 ]



     target fields        source transformations
Transformation

for row in source:
    result = transform(row, [ transformation)
    table.insert(result).execute()
OLAP with Cubes

 Data                                           Analysis and
          Extraction, Transformation, Loading
Sources                                         Presentation

                       Data Governance

                   Technologies and Utilities
Model
           {
               “name” = “My Model”
               “description” = ....

               “cubes” = [...]
               “dimensions” = [...]
           }




cubes                          dimensions
measures                        levels, attributes, hierarchy
logical




              physical

          ❄
1   load_model("model.json")

           Application



                  ∑

                                 3   model.cube("sales")
                                 4   workspace.browser(cube)


             cubes

       Aggregation Browser
            backend



2   create_workspace("sql",
                     model,
                     url="sqlite:///data.sqlite")
browser.aggregate(o cell,
                  . drilldown=[9 "sector"])




                        drill-down
for row in result.table_rows(“sector”):




          row.record["amount_sum"]
q row.label                     k row.key
whole cube


                                           o cell = Cell(cube)
                                           browser.aggregate(o cell)
                Total




                                          browser.aggregate(o cell,
                                                       drilldown=[9 “date”])


2006 2007 2008 2009 2010


                                          ✂ cut = PointCut(9 “date”, [2010])
                                          o cell = o cell.slice(✂ cut)

                                          browser.aggregate(o cell,
                                                       drilldown=[9 “date”])
Jan   Feb Mar Apr March April May   ...
How can Python
  be Useful
just the   Language
 ■ saves maintenance resources
 ■ shortens development time
 ■ saves your from going insane
Source               Staging Area      Operational Data Store   Datamarts
Systems



   structured
   documents




   databases
                                      faster
                Temporary
                Staging
                Area
      APIs




                            staging               relational        dimensional

                             L0                     L1                 L2
faster                      advanced


 Data                                            Analysis and
          Extraction, Transformation, Loading
Sources                                          Presentation

                       Data Governance

                   Technologies and Utilities




    understandable, maintainable
Conclusion
BI is about…



       people

technology processes
don’t forget
 metadata
Future

who is going to fix your COBOL Java tool
 if you have only Python guys around?
is capable, let’s start
Thank You
      [t


          Twitter:

        @Stiivi
     DataBrewery blog:

blog.databrewery.org
          Github:

  github.com/Stiivi

Mais conteúdo relacionado

Mais procurados

Personalized news recommendation engine
Personalized news recommendation enginePersonalized news recommendation engine
Personalized news recommendation enginePrateek Sachdev
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLAmazon Web Services
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query BasicsIdo Green
 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLEDB
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouseAltinity Ltd
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howAltinity Ltd
 
Movie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens DatasetMovie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens DatasetJagruti Joshi
 
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020 AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020 AWSKRUG - AWS한국사용자모임
 
Recommendation system
Recommendation systemRecommendation system
Recommendation systemAkshat Thakar
 
BigQuery walk through.pptx
BigQuery walk through.pptxBigQuery walk through.pptx
BigQuery walk through.pptxVikRam S
 
mysql 8.0 architecture and enhancement
mysql 8.0 architecture and enhancementmysql 8.0 architecture and enhancement
mysql 8.0 architecture and enhancementlalit choudhary
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introductionLiang Xiang
 
Recommendation system (1).pptx
Recommendation system (1).pptxRecommendation system (1).pptx
Recommendation system (1).pptxprathammishra28
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Faisal Siddiqi
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scaleVinay Kumar Chella
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated RecommendationsHarald Steck
 

Mais procurados (20)

Personalized news recommendation engine
Personalized news recommendation enginePersonalized news recommendation engine
Personalized news recommendation engine
 
Recommender system
Recommender systemRecommender system
Recommender system
 
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQLNEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
NEW LAUNCH! Intro to Amazon Athena. Analyze data in S3, using SQL
 
Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Big Query Basics
Big Query BasicsBig Query Basics
Big Query Basics
 
Best Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQLBest Practices in Security with PostgreSQL
Best Practices in Security with PostgreSQL
 
User behavior analytics
User behavior analyticsUser behavior analytics
User behavior analytics
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Your first ClickHouse data warehouse
Your first ClickHouse data warehouseYour first ClickHouse data warehouse
Your first ClickHouse data warehouse
 
ClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and howClickHouse Monitoring 101: What to monitor and how
ClickHouse Monitoring 101: What to monitor and how
 
Movie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens DatasetMovie Recommendation System - MovieLens Dataset
Movie Recommendation System - MovieLens Dataset
 
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020 AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
AWS기반 서버리스 데이터레이크 구축하기 - 김진웅 (SK C&C) :: AWS Community Day 2020
 
Recommendation system
Recommendation systemRecommendation system
Recommendation system
 
BigQuery walk through.pptx
BigQuery walk through.pptxBigQuery walk through.pptx
BigQuery walk through.pptx
 
mysql 8.0 architecture and enhancement
mysql 8.0 architecture and enhancementmysql 8.0 architecture and enhancement
mysql 8.0 architecture and enhancement
 
Recommender system introduction
Recommender system   introductionRecommender system   introduction
Recommender system introduction
 
Recommendation system (1).pptx
Recommendation system (1).pptxRecommendation system (1).pptx
Recommendation system (1).pptx
 
Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019Netflix talk at ML Platform meetup Sep 2019
Netflix talk at ML Platform meetup Sep 2019
 
Cassandra serving netflix @ scale
Cassandra serving netflix @ scaleCassandra serving netflix @ scale
Cassandra serving netflix @ scale
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 

Semelhante a Python business intelligence (PyData 2012 talk)

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social mediaDataWorks Summit
 
Tspbug 2 24_2014_final
Tspbug 2 24_2014_finalTspbug 2 24_2014_final
Tspbug 2 24_2014_finalEd Senez
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIDenny Lee
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Mark Tabladillo
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Stefan Urbanek
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Paulo Gandra de Sousa
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Martin Bém
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningPaco Nathan
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Knowerce
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
 
A general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4JA general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4JFlorent Biville
 
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP IntegrationRaymond Gao
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Data Access Tech Ed India
Data Access   Tech Ed IndiaData Access   Tech Ed India
Data Access Tech Ed Indiarsnarayanan
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 

Semelhante a Python business intelligence (PyData 2012 talk) (20)

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Klout changing landscape of social media
Klout changing landscape of social mediaKlout changing landscape of social media
Klout changing landscape of social media
 
Tspbug 2 24_2014_final
Tspbug 2 24_2014_finalTspbug 2 24_2014_final
Tspbug 2 24_2014_final
 
How Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BIHow Klout is changing the landscape of social media with Hadoop and BI
How Klout is changing the landscape of social media with Hadoop and BI
 
Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106Data Mining with Excel 2010 and PowerPivot 201106
Data Mining with Excel 2010 and PowerPivot 201106
 
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
Cubes - Lightweight Python OLAP (EuroPython 2012 talk)
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)Patterns of Enterprise Application Architecture (by example)
Patterns of Enterprise Application Architecture (by example)
 
PoEAA by Example
PoEAA by ExamplePoEAA by Example
PoEAA by Example
 
Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23Prague data management meetup 2017-01-23
Prague data management meetup 2017-01-23
 
OSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine LearningOSCON 2014: Data Workflows for Machine Learning
OSCON 2014: Data Workflows for Machine Learning
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010Datacamp @ Transparency Camp 2010
Datacamp @ Transparency Camp 2010
 
Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications Spark Based Distributed Deep Learning Framework For Big Data Applications
Spark Based Distributed Deep Learning Framework For Big Data Applications
 
A general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4JA general introduction to Spring Data / Neo4J
A general introduction to Spring Data / Neo4J
 
Salesforce & SAP Integration
Salesforce & SAP IntegrationSalesforce & SAP Integration
Salesforce & SAP Integration
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Data Access Tech Ed India
Data Access   Tech Ed IndiaData Access   Tech Ed India
Data Access Tech Ed India
 
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 

Mais de Stefan Urbanek

Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Stefan Urbanek
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionStefan Urbanek
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explainedStefan Urbanek
 
Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deploymentStefan Urbanek
 
Knowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsKnowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsStefan Urbanek
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionStefan Urbanek
 
Dallas Data Brewery - introduction
Dallas Data Brewery - introductionDallas Data Brewery - introduction
Dallas Data Brewery - introductionStefan Urbanek
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsStefan Urbanek
 
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleKnowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleStefan Urbanek
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsStefan Urbanek
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceStefan Urbanek
 
Open spending as-is 2011-06
Open spending   as-is 2011-06Open spending   as-is 2011-06
Open spending as-is 2011-06Stefan Urbanek
 
Cubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkCubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkStefan Urbanek
 
Open Data Decentralisation
Open Data DecentralisationOpen Data Decentralisation
Open Data DecentralisationStefan Urbanek
 
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Stefan Urbanek
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management IntroductionStefan Urbanek
 

Mais de Stefan Urbanek (19)

StepTalk Introduction
StepTalk IntroductionStepTalk Introduction
StepTalk Introduction
 
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
Forces and Threats in a Data Warehouse (and why metadata and architecture is ...
 
Sepro - introduction
Sepro - introductionSepro - introduction
Sepro - introduction
 
New york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introductionNew york data brewery meetup #1 – introduction
New york data brewery meetup #1 – introduction
 
Cubes 1.0 Overview
Cubes 1.0 OverviewCubes 1.0 Overview
Cubes 1.0 Overview
 
Cubes – pluggable model explained
Cubes – pluggable model explainedCubes – pluggable model explained
Cubes – pluggable model explained
 
Cubes – ways of deployment
Cubes – ways of deploymentCubes – ways of deployment
Cubes – ways of deployment
 
Knowledge Management Lecture 4: Models
Knowledge Management Lecture 4: ModelsKnowledge Management Lecture 4: Models
Knowledge Management Lecture 4: Models
 
Dallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality PerceptionDallas Data Brewery Meetup #2: Data Quality Perception
Dallas Data Brewery Meetup #2: Data Quality Perception
 
Dallas Data Brewery - introduction
Dallas Data Brewery - introductionDallas Data Brewery - introduction
Dallas Data Brewery - introduction
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Knowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: CycleKnowledge Management Lecture 3: Cycle
Knowledge Management Lecture 3: Cycle
 
Knowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizationsKnowledge Management Lecture 2: Individuals, communities and organizations
Knowledge Management Lecture 2: Individuals, communities and organizations
 
Knowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presenceKnowledge Management Lecture 1: definition, history and presence
Knowledge Management Lecture 1: definition, history and presence
 
Open spending as-is 2011-06
Open spending   as-is 2011-06Open spending   as-is 2011-06
Open spending as-is 2011-06
 
Cubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP FrameworkCubes - Lightweight OLAP Framework
Cubes - Lightweight OLAP Framework
 
Open Data Decentralisation
Open Data DecentralisationOpen Data Decentralisation
Open Data Decentralisation
 
Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)Data Cleansing introduction (for BigClean Prague 2011)
Data Cleansing introduction (for BigClean Prague 2011)
 
Knowledge Management Introduction
Knowledge Management IntroductionKnowledge Management Introduction
Knowledge Management Introduction
 

Python business intelligence (PyData 2012 talk)

  • 1. Python for Business Intelligence Štefan Urbánek ■ @Stiivi ■ stefan.urbanek@continuum.io ■ PyData NYC, October 2012
  • 3. Results Q/A and articles with Java solution references (not listed here)
  • 4.
  • 6. Overview ■ Traditional Data Warehouse ■ Python and Data ■ Is Python Capable? ■ Conclusion
  • 9. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 10. Traditional Data Warehouse
  • 11. ■ Extracting data from the original sources ■ Quality assuring and cleaning data ■ Conforming the labels and measures in the data to achieve consistency across the original sources ■ Delivering data in a physical format that can be used by query tools, report writers, and dashboards. Source: Ralph Kimball – The Data Warehouse ETL Toolkit
  • 12. Source Staging Area Operational Data Store Datamarts Systems structured documents databases Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 13. real time = daily
  • 14. Multi-dimensional Modeling
  • 15.
  • 16. aggregation browsing slicing and dicing
  • 17. business / analyst’s point of view regardless of physical schema implementation
  • 18. Facts measurable fact fact data cell most detailed information
  • 19. location type time dimensions
  • 20. Dimension ■ provide context for facts ■ used to filter queries or reports ■ control scope of aggregation of facts
  • 22. Python and Data community perception* *as of Oct 2012
  • 25. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 26. Scientific Data T1[s] T2[s] T3[s] T4[s] P1 112,68 941,67 171,01 660,48 P2 96,15 306,51 725,88 877,82 P3 313,39 189,31 41,81 428,68 P4 760,62 983,48 371,21 281,19 P5 838,56 39,27 389,42 231,12 n-dimensional array of numbers
  • 27. Assumptions ■ data is mostly numbers ■ data is neatly organized... ■ … in one multi-dimensional array
  • 28. Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 30. multiple snapshots of one source multiple representations categories are of same data changing
  • 31.
  • 32. Is Python Capable? very basic examples
  • 33. Data Pipes with SQLAlchemy Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 34. ■ connection: create_engine ■ schema reflection: MetaData, Table ■ expressions: select(), insert()
  • 35. src_engine = create_engine("sqlite:///data.sqlite") src_metadata = MetaData(bind=src_engine) src_table = Table('data', src_metadata, autoload=True) target_engine = create_engine("postgres://localhost/sandbox") target_metadata = MetaData(bind=target_engine) target_table = Table('data', target_metadata)
  • 36. clone schema: for column in src_table.columns: target_table.append_column(column.copy()) target_table.create() copy data: insert = target_table.insert() for row in src_table.select().execute(): insert.execute(row)
  • 38. text file (CSV) to table: reader = csv.reader(file_stream) columns = reader.next() for column in columns: table.append_column(Column(column, String)) table.create() for row in reader: insert.execute(row)
  • 39. Simple T from ETL Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 40. transformation = [ ('fiscal_year', {"w function": int, ". field":"fiscal_year"}), ('region_code', {"4 mapping": region_map, ". field":"region"}), ('borrower_country', None), ('project_name', None), ('procurement_type', None), ('major_sector_code', {"4 mapping": sector_code_map, ". field":"major_sector"}), ('major_sector', None), ('supplier', None), ('contract_amount', {"w function": currency_to_number, ". field": 'total_contract_amount'} ] target fields source transformations
  • 41. Transformation for row in source: result = transform(row, [ transformation) table.insert(result).execute()
  • 42. OLAP with Cubes Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities
  • 43. Model { “name” = “My Model” “description” = .... “cubes” = [...] “dimensions” = [...] } cubes dimensions measures levels, attributes, hierarchy
  • 44. logical physical ❄
  • 45. 1 load_model("model.json") Application ∑ 3 model.cube("sales") 4 workspace.browser(cube) cubes Aggregation Browser backend 2 create_workspace("sql", model, url="sqlite:///data.sqlite")
  • 46. browser.aggregate(o cell, . drilldown=[9 "sector"]) drill-down
  • 47. for row in result.table_rows(“sector”): row.record["amount_sum"] q row.label k row.key
  • 48. whole cube o cell = Cell(cube) browser.aggregate(o cell) Total browser.aggregate(o cell, drilldown=[9 “date”]) 2006 2007 2008 2009 2010 ✂ cut = PointCut(9 “date”, [2010]) o cell = o cell.slice(✂ cut) browser.aggregate(o cell, drilldown=[9 “date”]) Jan Feb Mar Apr March April May ...
  • 49. How can Python be Useful
  • 50. just the Language ■ saves maintenance resources ■ shortens development time ■ saves your from going insane
  • 51. Source Staging Area Operational Data Store Datamarts Systems structured documents databases faster Temporary Staging Area APIs staging relational dimensional L0 L1 L2
  • 52. faster advanced Data Analysis and Extraction, Transformation, Loading Sources Presentation Data Governance Technologies and Utilities understandable, maintainable
  • 54. BI is about… people technology processes
  • 56. Future who is going to fix your COBOL Java tool if you have only Python guys around?
  • 58. Thank You [t Twitter: @Stiivi DataBrewery blog: blog.databrewery.org Github: github.com/Stiivi