SlideShare uma empresa Scribd logo
1 de 15
Baixar para ler offline
Using distributed technologies
to analyze Big Data

                    Abhijit Sharma
                    Innovation Lab
                    BMC Software




                                     1
Data Explosion in Data Center
• Performance / Time Series Data
    § Incoming data rates ~Millions of data
        points/ min
    § Data generated/server/year ~ 2 GB
    § 50 K servers ~ 100 TB data / year




                                              2
Online Warehouse - Time Series
   § Extreme storage requirements – TS data for a data center e.g. last
       year
   § Online TS data availability i.e. no separate ETL
   § Support for common analytics operations
           § Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc
           § Slice and Dice – CPU util. for UNIX servers in SFO data center last week
           § Statistical Operations : sum, count, avg., var, std. moving avg., frequency
                distributions, forecasting etc
   § Ease of use – SQL interface, design schema for TS data
   § Horizontal scaling - lower cost commodity hardware
                                                            OS          Data Cube -
   § High R/W volume                                                    CPU
                                                                        Time
                                                   Data
                                                   Center




                                                                                      3
P
a
g    Why not use RDBMS based Data
e
4    Warehousing?
|    Star schema – dimensions & facts
6/5/11 §   Offline data availability – ETL required – not online
      § Expensive to scale vertically – High end Hardware & Software
      § Limits to vertical scaling – big data may not fit
      § Features like transactions etc are unnecessary and a overhead
          for certain applications
      § Large scale distributed/partitioning is painful – sub optimal
          on high W/R ratios
      § Flexible Schema support which can be changed on the fly is
           not possible

                                                                        4
High Level Architecture


  Real time Continuous                      Schema &
  load of Metric &                          Query
  Dimension Data


                         Hive – Distributed SQL


            NoSQL Column Store - HBase


            Hadoop HDFS & Map Reduce Framework




                          Map Reduce & HDFS Nodes
                                                       5
P
a
g
e
     Map Reduce - Recap
6        Map Function                                   Reduce Function
                        § Apply to input data, Emits         § Apply to data grouped by reduction key
|
                            reduction key and value          § Often ‘reduces’ data (for example –
6/5/11                  § Output of Map is sorted              sum(values))
                            and partitioned for use    Mappers and Reducers can be chained together
                            by Reducers
                                Mappers and Reducers can be chained together




                                                                                                6
P
a
g
e
     HDFS Sweet spot
7

|     § Big Data Storage : Optimized for large files (ETL)
6/5/11 §   Writes are create, append, and large
      § Reads are mostly big and streaming
      § Throughput is more important than latency
      § Distributed, HA, Transparent Replication




                                                             7
When is raw HDFS unsuitable?
• Mutable data – Create, Update, Delete
• Small writes
• Random reads, % of small reads
• Structured data
• Online access to data – HDFS Loading is
   offline / batch process


                                            8
P
a
g
e
     NoSQL Data stores - Column
9

|        § Excellent W/R concurrent performance – fast writes
             and fast reads (random and sequential) – this is
6/5/11
             required for near real time update of data to TS Data
         § Distributed architecture, horizontal scaling, transparent
             replication of data
         § Highly Available (HA) and Fault Tolerant (FT) for no
            SPOF – shared nothing architecture
         § Reasonably rich data model
         § Flexible in terms of schema – amenable to ad-hoc
             changes even at runtime



                                                                  9
P
a
g
e
     HBase
10
         § (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored
|             value 
         § Table is split into multiple equal sized regions each of which is a range of
6/5/11       sorted keys (partitioned automatically by the key)
         § Ordered Rows by key, Ordered columns in a Column Family
         § Table schema defines Column Families
         § Rows can have different number of columns
         § Columns have value and versions (any number)
         § Column range and key range queries

          Row Key        Column Family (dimensions)       Column Family
                                                          (metric)
          112334-7782    server : host1   dc : PUNE       value:20

          112334-7783             server:host2            value:10

                                                                                      10
P
a
g
e
      Hive – Distributed SQL > MR
11
       § MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining
|
           several Mappers & Reducers required
6/5/11 §
           Hive provides familiar SQL queries which automatically gets translated to a flow
              of appropriate Mappers and Reducers that execute the query leveraging MR.
       § Leverages Hadoop ecosystem - MR, HDFS, HBase

       § Hive defines a schema for the meta-tables it will use to build a schema its SQL
            queries can use and to store metadata
       § Storage Handlers for HDFS, HBase

       § Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc
            clauses
       § Hive stores the data partitioned by partitions (you can specify partitioning key
            while loading Hive tables) and buckets (useful for statistical operations like
            sampling)
       § Hive queries can also include custom map/reduce tasks as scripts

                                                                                              11
Hive Queries - CREATE
TABLE                               EXTERNAL TABLE



CREATE TABLE wordfreq (word       CREATE external TABLE iops(key
  STRING, freq INT) ROW FORMAT      string, os string, deploymentsize
  DELIMITED FIELDS TERMINATED       string, ts int, value int) STORED
  BY 't' STORED AS TEXTFILE;       BY
                                    'org.apache.hadoop.hive.hbase.HB
LOAD DATA LOCAL INPATH              aseStorageHandler' WITH
  ‘freq.txt' OVERWRITE INTO TABLE   SERDEPROPERTIES
  wordfreq;                         ("hbase.columns.mapping" =
                                    ":key,data:os,data:deploymentSize,
                                    data:ts,data:value")




                                                                    12
Hive Queries - SELECT
TABLE                                      EXTERNAL TABLE
select * from wordfreq where freq >        select ts, avg(value) as cpu from
   100 sort by freq desc limit 3;             cpu_util_5min group by ts;
explain select * from wordfreq where       select architecture, avg(value) as cpu
   freq > 100 sort by freq desc limit 3;      from cpu_util_5min group by
                                              architecture;
select freq, count(*) AS f2 from
   wordfreq group by freq sort by f2
   desc limit 3;




                                                                                13
P
a
g
e
        Hive – SQL -> Map Reduce
     CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix
14
     SELECT timestamp, AVG(value)

|    FROM timeseries WHERE server-type = ‘Unix’


6/5/11 BY timestamp
   GROUP

           timeseries




                                                         Shuffle                             Reduce
                               Map
                                                          Sort




                                                                                                                               14
Thanks



         15

Mais conteúdo relacionado

Mais de IndicThreads

Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreadsIndicThreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprisesIndicThreads
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIndicThreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present FutureIndicThreads
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams IndicThreads
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameIndicThreads
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceIndicThreads
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java CarputerIndicThreads
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & DockerIndicThreads
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackIndicThreads
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack CloudsIndicThreads
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!IndicThreads
 
Architectural Considerations For Complex Mobile And Web Applications
 Architectural Considerations For Complex Mobile And Web Applications Architectural Considerations For Complex Mobile And Web Applications
Architectural Considerations For Complex Mobile And Web ApplicationsIndicThreads
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8IndicThreads
 
Changing application demands: What developers need to know
Changing application demands: What developers need to knowChanging application demands: What developers need to know
Changing application demands: What developers need to knowIndicThreads
 
Data Privacy using IoTs in Smart Cities Project
 Data Privacy using IoTs in Smart Cities Project Data Privacy using IoTs in Smart Cities Project
Data Privacy using IoTs in Smart Cities ProjectIndicThreads
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon RedshiftIndicThreads
 
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karanIndic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karanIndicThreads
 
Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5IndicThreads
 

Mais de IndicThreads (20)

Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprises
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fame
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads Conference
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java Carputer
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedback
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!
 
Architectural Considerations For Complex Mobile And Web Applications
 Architectural Considerations For Complex Mobile And Web Applications Architectural Considerations For Complex Mobile And Web Applications
Architectural Considerations For Complex Mobile And Web Applications
 
Functional Programming With Lambdas and Streams in JDK8
 Functional Programming With Lambdas and Streams in JDK8 Functional Programming With Lambdas and Streams in JDK8
Functional Programming With Lambdas and Streams in JDK8
 
Changing application demands: What developers need to know
Changing application demands: What developers need to knowChanging application demands: What developers need to know
Changing application demands: What developers need to know
 
Data Privacy using IoTs in Smart Cities Project
 Data Privacy using IoTs in Smart Cities Project Data Privacy using IoTs in Smart Cities Project
Data Privacy using IoTs in Smart Cities Project
 
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
Big Data Analytics using Amazon Elastic MapReduce and Amazon Redshift
 
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karanIndic threads pune12-grammar of graphicsa new approach to visualization-karan
Indic threads pune12-grammar of graphicsa new approach to visualization-karan
 
Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5Indic threads pune12-java ee 7 platformsimplification html5
Indic threads pune12-java ee 7 platformsimplification html5
 

Último

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Último (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Using the cloud and distributed technologies to analyze big data in the enterprise - Indicthreads cloud computing conference 2011

  • 1. Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1
  • 2. Data Explosion in Data Center • Performance / Time Series Data § Incoming data rates ~Millions of data points/ min § Data generated/server/year ~ 2 GB § 50 K servers ~ 100 TB data / year 2
  • 3. Online Warehouse - Time Series § Extreme storage requirements – TS data for a data center e.g. last year § Online TS data availability i.e. no separate ETL § Support for common analytics operations § Roll-up data e.g. CPU/min to CPU/hour, CPU/day etc § Slice and Dice – CPU util. for UNIX servers in SFO data center last week § Statistical Operations : sum, count, avg., var, std. moving avg., frequency distributions, forecasting etc § Ease of use – SQL interface, design schema for TS data § Horizontal scaling - lower cost commodity hardware OS Data Cube - § High R/W volume CPU Time Data Center 3
  • 4. P a g Why not use RDBMS based Data e 4 Warehousing? | Star schema – dimensions & facts 6/5/11 § Offline data availability – ETL required – not online § Expensive to scale vertically – High end Hardware & Software § Limits to vertical scaling – big data may not fit § Features like transactions etc are unnecessary and a overhead for certain applications § Large scale distributed/partitioning is painful – sub optimal on high W/R ratios § Flexible Schema support which can be changed on the fly is not possible 4
  • 5. High Level Architecture Real time Continuous Schema & load of Metric & Query Dimension Data Hive – Distributed SQL NoSQL Column Store - HBase Hadoop HDFS & Map Reduce Framework Map Reduce & HDFS Nodes 5
  • 6. P a g e Map Reduce - Recap 6 Map Function Reduce Function § Apply to input data, Emits § Apply to data grouped by reduction key | reduction key and value § Often ‘reduces’ data (for example – 6/5/11 § Output of Map is sorted sum(values)) and partitioned for use Mappers and Reducers can be chained together by Reducers Mappers and Reducers can be chained together 6
  • 7. P a g e HDFS Sweet spot 7 | § Big Data Storage : Optimized for large files (ETL) 6/5/11 § Writes are create, append, and large § Reads are mostly big and streaming § Throughput is more important than latency § Distributed, HA, Transparent Replication 7
  • 8. When is raw HDFS unsuitable? • Mutable data – Create, Update, Delete • Small writes • Random reads, % of small reads • Structured data • Online access to data – HDFS Loading is offline / batch process 8
  • 9. P a g e NoSQL Data stores - Column 9 | § Excellent W/R concurrent performance – fast writes and fast reads (random and sequential) – this is 6/5/11 required for near real time update of data to TS Data § Distributed architecture, horizontal scaling, transparent replication of data § Highly Available (HA) and Fault Tolerant (FT) for no SPOF – shared nothing architecture § Reasonably rich data model § Flexible in terms of schema – amenable to ad-hoc changes even at runtime 9
  • 10. P a g e HBase 10 § (Table, Row, Column Family:Column, Timestamp) tuple maps to a stored | value  § Table is split into multiple equal sized regions each of which is a range of 6/5/11 sorted keys (partitioned automatically by the key) § Ordered Rows by key, Ordered columns in a Column Family § Table schema defines Column Families § Rows can have different number of columns § Columns have value and versions (any number) § Column range and key range queries Row Key Column Family (dimensions) Column Family (metric) 112334-7782 server : host1 dc : PUNE value:20 112334-7783 server:host2 value:10 10
  • 11. P a g e Hive – Distributed SQL > MR 11 § MR is not easy to code for analytics tasks (e.g. group, aggregate etc.) chaining | several Mappers & Reducers required 6/5/11 § Hive provides familiar SQL queries which automatically gets translated to a flow of appropriate Mappers and Reducers that execute the query leveraging MR. § Leverages Hadoop ecosystem - MR, HDFS, HBase § Hive defines a schema for the meta-tables it will use to build a schema its SQL queries can use and to store metadata § Storage Handlers for HDFS, HBase § Hive SQL supports common SQL select, filter, grouping, aggregation, insert etc clauses § Hive stores the data partitioned by partitions (you can specify partitioning key while loading Hive tables) and buckets (useful for statistical operations like sampling) § Hive queries can also include custom map/reduce tasks as scripts 11
  • 12. Hive Queries - CREATE TABLE EXTERNAL TABLE CREATE TABLE wordfreq (word CREATE external TABLE iops(key STRING, freq INT) ROW FORMAT string, os string, deploymentsize DELIMITED FIELDS TERMINATED string, ts int, value int) STORED BY 't' STORED AS TEXTFILE; BY 'org.apache.hadoop.hive.hbase.HB LOAD DATA LOCAL INPATH aseStorageHandler' WITH ‘freq.txt' OVERWRITE INTO TABLE SERDEPROPERTIES wordfreq; ("hbase.columns.mapping" = ":key,data:os,data:deploymentSize, data:ts,data:value") 12
  • 13. Hive Queries - SELECT TABLE EXTERNAL TABLE select * from wordfreq where freq > select ts, avg(value) as cpu from 100 sort by freq desc limit 3; cpu_util_5min group by ts; explain select * from wordfreq where select architecture, avg(value) as cpu freq > 100 sort by freq desc limit 3; from cpu_util_5min group by architecture; select freq, count(*) AS f2 from wordfreq group by freq sort by f2 desc limit 3; 13
  • 14. P a g e Hive – SQL -> Map Reduce CPU utilization / 5 min with dimensions server, server-type, cluster, data-center, group by server-type and filter by value Unix 14 SELECT timestamp, AVG(value) | FROM timeseries WHERE server-type = ‘Unix’ 6/5/11 BY timestamp GROUP timeseries Shuffle Reduce Map Sort 14
  • 15. Thanks 15