SlideShare uma empresa Scribd logo
1 de 18
Baixar para ler offline
SQL Server Live! Orlando 2012




                         Hadoop and its Ecosystem
                         Components in Action
                                                                Andrew Brust
                                                                CEO and Founder
                                                              Blue Badge Insights
                                                                  Level: Intermediate




                      Meet Andrew
                          •   CEO and Founder, Blue Badge Insights
                          •   Big Data blogger for ZDNet
                          •   Microsoft Regional Director, MVP
                          •   Co-chair VSLive! and 17 years as a speaker
                          •   Founder, Microsoft BI User Group of NYC
                              – http://www.msbinyc.com
                          •   Co-moderator, NYC .NET Developers Group
                              – http://www.nycdotnetdev.com
                          •   “Redmond Review” columnist for
                              Visual Studio Magazine and Redmond Developer
                              News
                          •   brustblog.com, Twitter: @andrewbrust




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust                   1
SQL Server Live! Orlando 2012




                          My New Blog (bit.ly/bigondata)




                          MapReduce, in a Diagram


                                  Input   mapper   Output

                                                            K1

                                  Input   mapper   Output   Input   reducer   Output


                                                                                       Output
                                                            K2
                                  Input   mapper   Output   Input   reducer   Output
                       Input
                                                            K3
                                  Input   mapper   Output
                                                            Input   reducer   Output


                                  Input   mapper   Output


                                  Input   mapper   Output




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust                           2
SQL Server Live! Orlando 2012




                          A MapReduce Example


                                              • Count by suite, on each floor

                                              • Send per-suite, per platform totals to lobby

                                              • Sort totals by platform

                                              • Send two platform packets to 10th, 20th, 30th floor

                                              • Tally up each platform

                                              • Collect the tallies

                                              • Merge tallies into one spreadsheet




                          What’s a Distributed File System?
                          •   One where data gets distributed over
                              commodity drives on commodity servers
                          •   Data is replicated
                          •   If one box goes down, no data lost
                              – Except the name node = SPOF!
                          •   BUT: HDFS is immutable
                              – Files can only be written to once
                              – So updates require drop + re-write (slow)




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust                                 3
SQL Server Live! Orlando 2012




                          Hadoop = MapReduce + HDFS
                          •    Modeled after Google MapReduce + GFS
                          •    Have more data? Just add more nodes to
                               cluster.
                                – Mappers execute in parallel
                                – Hardware is commodity
                                – “Scaling out”
                          •    Use of HDFS means data may well be local
                               to mapper processing
                          •    So, not just parallel, but minimal data
                               movement, which avoids network
                               bottlenecks




                          The Hadoop Stack
                              Log file integration



                              Machine Learning/Data Mining

                              RDBMS Import/Export

                              Query: HiveQL and Pig Latin

                              Database

                              MapReduce, HDFS




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust     4
SQL Server Live! Orlando 2012




                          Ways to work
                          •   Amazon Web Services Elastic MapReduce
                              – Create AWS account
                              – Select Elastic MapReduce in Dashboard
                          •   Microsoft Hadoop on Azure
                              – Visit www.hadooponazure.com
                              – Request invite
                          •   Cloudera CDH VM image
                              – Download
                              – Run via VMWare Player




                          Amazon Elastic MapReduce
                          •   Lots of steps!
                          •   At a high level:
                              – Setup AWS account and S3 “buckets”
                              – Generate Key Pair and PEM file
                              – Install Ruby and EMR Command Line Interface
                              – Provision the cluster using CLI
                              – Setup and run SSH/PuTTY
                              – Work interactively at command line




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust         5
SQL Server Live! Orlando 2012




                          Amazon EMR – Prep Steps
                          •   Create an AWS account
                          •   Create an S3 bucket for log storage
                              – with list permissions for authenticated users
                          •   Create a Key Pair and save PEM file
                          •   Install Ruby
                          •   Install Amazon Web Services Elastic
                              MapReduce Command Line Interface
                              – aka AWS EMR CLI 
                          •   Create credentials.json in EMR CLI folder
                              – Associate with same region as where key pair created




                          Amazon – Security and Startup
                          •   Security
                              –   Download PuTTYgen and run it
                              –   Click Load and browse to PEM file
                              –   Save it in PPK format
                              –   Exit PuTTYgen
                          •   In a command window, navigate to EMR CLI
                              folder and enter command:
                              – ruby elastic-mapreduce --create --alive [--num-instance xx]
                                [--pig-interactive] [--hive-interactive] [--hbase --instance-type
                                m1.large]
                          •   In AWS Console, go to EC2 Dashboard and
                              click Instances on left nav bar
                          •   Wait until instance is running and get its
                              Public DNS name
                              – Use Compatibility View in IE or copy may not work




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust                               6
SQL Server Live! Orlando 2012




                          Connect!
                          •   Download and run PuTTY
                          •   Paste DNS name of EC2 instance into hostname
                              field
                          •   In Treeview, drill down and navigate to
                              ConnectionSSHAuth, browse to PPK file
                          •   Once EC2 instance(s) running, click Open
                          •   Click Yes to “The server’s host key is not cached
                              in the registry…” PuTTY Security Alert
                          •   When prompted for user name, type “hadoop” and
                              hit Enter
                          •   cd bin, then hive, pig, hbase shell
                          •   Right-click to paste from clipboard; option to go
                              full-screen
                          •   (Kill EC2 instance(s) from Dashboard when done)




                           Amazon Elastic MapReduce




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust             7
SQL Server Live! Orlando 2012




                          Microsoft Hadoop on Azure
                          •   Much simpler
                          •   Browser-based portal
                              – Provisioning cluster, managing ports, MapReduce jobs
                              – External data from Azure BLOB storage
                          •   Interactive JavaScript console
                              – HDFS, Pig, light data visualization
                          •   Interactive Hive console
                              – Hive commands and metadata discovery
                          •   From Portal page you can RDP directly to
                              Hadoop head node
                              – Double click desktop shortcut for CLI access
                              – Certain environment variables may need to be set




                           Microsoft HDInsight




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust                  8
SQL Server Live! Orlando 2012




                          Hadoop commands
                          •   HDFS
                              – hadoop fs filecommand
                              – Create and remove directories:
                                 mkdir, rm, rmr
                              – Upload and download files to/from HDFS
                                 get, put
                              – View directory contents
                                 ls, lsr
                              – Copy, move, view files
                                 cp, mv, cat
                          •   MapReduce
                              – Run a Java jar-file based job
                                 hadoop jar jarname params




                           Hadoop (directly)




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust    9
SQL Server Live! Orlando 2012




                          HBase
                          •   Concepts:
                              – Tables, column families
                              – Columns, rows
                              – Keys, values
                          •   Commands:
                              – Definition: create, alter, drop, truncate
                              – Manipulation: get, put, delete, deleteall, scan
                              – Discovery: list, exists, describe, count
                              – Enablement: disable, enable
                              – Utilities: version, status, shutdown, exit
                              – Reference: http://wiki.apache.org/hadoop/Hbase/Shell




                          HBase Examples
                          •   create 't1', 'f1', 'f2', 'f3'
                          •   describe 't1'
                          •   alter 't1', {NAME => 'f1',
                              VERSIONS => 5}
                          •   put 't1', 'r1', 'c1', 'value', ts1
                          •   get 't1', 'r1'
                          •   count 't1'




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust                  10
SQL Server Live! Orlando 2012




                           HBase




                          Hive
                          •   Used by most BI products which connect
                              to Hadoop
                          •   Provides a SQL-like abstraction over
                              Hadoop
                              – Officially HiveQL, or HQL
                          •   Works on own tables, but also on HBase
                          •   Query generates MapReduce job, output of
                              which becomes result set
                          •   Microsoft has Hive ODBC driver
                              – Connects Excel, Reporting Services, PowerPivot,
                                Analysis Services Tabular Mode (only)




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust             11
SQL Server Live! Orlando 2012




                          Hive, Continued
                          •   Load data from flat HDFS files
                              – LOAD DATA LOCAL INPATH
                                './examples/files/kv1.txt‘
                                OVERWRITE INTO TABLE pokes;
                          •   SQL Queries
                              – CREATE, ALTER, DROP
                              – INSERT OVERWRITE (creates whole tables)
                              – SELECT, JOIN, WHERE, GROUP BY
                              – SORT BY, but ordering data is tricky!
                              – USING allows for streaming on map, reduce steps




                           Hive




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust             12
SQL Server Live! Orlando 2012




                          Pig
                          •   Instead of SQL, employs a language (“Pig
                              Latin”) that accommodates data flow
                              expressions
                              – Do a combo of Query and ETL
                          •   “10 lines of Pig Latin ≈ 200 lines of Java.”
                          •   Works with structured or unstructured data
                          •   Operations
                              – As with Hive, a MapReduce job is generated
                              – Unlike Hive, output is only flat file to HDFS
                              – With MS Hadoop, can easily convert to JavaScript array,
                                then manipulate
                          •   Use command line (“Grunt”) or build scripts




                          Example
                          •   A = LOAD ‘myfile’
                                AS (x, y, z);
                              B = FILTER A by x > 0;
                              C = GROUP B BY x;
                              D = FOREACH A GENERATE
                                x, COUNT(B);
                              STORE D INTO ‘output’;




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust                     13
SQL Server Live! Orlando 2012




                          Pig Latin Examples
                          •   Imperative, file system commands
                              – LOAD, STORE
                                  Schema specified on LOAD
                          •   Declarative, query commands (SQL-like)
                              – xxx = table/file
                              – FOREACH xxx GENERATE (SELECT…FROM xxx)
                              – JOIN (WHERE/INNER JOIN)
                              – FILTER xxx BY (WHERE)
                              – ORDER xxx BY (ORDER BY)
                              – GROUP xxx BY / GENERATE COUNT(xxx)
                                (SELECT COUNT(*) GROUP BY)
                              – DISTINCT (SELECT DISTINCT)
                          •   Syntax is assignment statement-based:
                              – MyCusts = FILTER Custs BY SalesPerson eq 15
                          •   COGROUP, UDFs




                           Pig




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust         14
SQL Server Live! Orlando 2012




                          Sqoop
                          sqoop import
                           --connect
                            "jdbc:sqlserver://<servername>.
                             database.windows.net:1433;
                             database=<dbname>;
                             user=<username>@<servername>;
                             password=<password>"
                           --table <from_table>
                           --target-dir <to_hdfs_folder>
                           --split-by <from_table_column>




                          Sqoop
                          sqoop export
                           --connect
                            "jdbc:sqlserver://<servername>.
                             database.windows.net:1433;
                             database=<dbname>;
                             user=<username>@<servername>;
                             password=<password>"
                           --table <to_table>
                           --export-dir <from_hdfs_folder>
                           --input-fields-terminated-by
                            "<delimiter>"




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust   15
SQL Server Live! Orlando 2012




                          Flume NG
                          •     Source
                                 – Avro (data serialization system – can read json-
                                   encoded data files, and can work over RPC)
                                 – Exec (reads from stdout of long-running process)
                          •     Sinks
                                 – HDFS, HBase, Avro
                          •     Channels
                                 – Memory, JDBC, file




                          Flume NG
                          •     Setup conf/flume.conf
                          # Define a memory channel called ch1 on agent1
                          agent1.channels.ch1.type = memory

                          # Define an Avro source called avro-source1 on agent1 and tell it
                          # to bind to 0.0.0.0:41414. Connect it to channel ch1.
                          agent1.sources.avro-source1.channels = ch1
                          agent1.sources.avro-source1.type = avro
                          agent1.sources.avro-source1.bind = 0.0.0.0
                          agent1.sources.avro-source1.port = 41414

                          # Define a logger sink that simply logs all events it receives
                          # and connect it to the other end of the same channel.
                          agent1.sinks.log-sink1.channel = ch1
                          agent1.sinks.log-sink1.type = logger

                          # Finally, now that we've defined all of our components, tell
                          # agent1 which ones we want to activate.
                          agent1.channels = ch1
                          agent1.sources = avro-source1
                          agent1.sinks = log-sink1



                          •     From the command line:
                          flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust                         16
SQL Server Live! Orlando 2012




                          Mahout Algorithms
                          •   Recommendation
                              – Your info + community info
                              – Give users/items/ratings; get user-user/item-item
                              – itemsimilarity
                          •   Classification/Categorization
                              – Drop into buckets
                              – Naïve Bayes, Complementary Naïve Bayes, Decision
                                Forests
                          •   Clustering
                              – Like classification, but with categories unknown
                              – K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean-
                                Shift




                          Workflow, Syntax
                          •   Workflow
                              – Run the job
                              – Dump the output
                              – Visualize, predict
                          •   mahout algorithm
                                -- input folderspec
                                -- output folderspec
                                -- param1 value1
                                -- param2 value2
                              …
                          •   Example:
                              – mahout itemsimilarity
                                  --input <input-hdfs-path>
                                  --output <output-hdfs-path>
                                  --tempDir <tmp-hdfs-path>
                                  -s SIMILARITY_LOGLIKELIHOOD




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust               17
SQL Server Live! Orlando 2012




                          Resources
                          •   Big On Data blog
                              – http://www.zdnet.com/blog/big-data
                          •   Apache Hadoop home page
                              – http://hadoop.apache.org/
                          •   Hive & Pig home pages
                              – http://hive.apache.org/
                              – http://pig.apache.org/
                          •   Hadoop on Azure home page
                              – https://www.hadooponazure.com/
                          •   SQL Server 2012 Big Data
                              – http://bit.ly/sql2012bigdata




                          Thank you



                          •   andrew.brust@bluebadgeinsights.com
                          •   @andrewbrust on twitter
                          •   Get Blue Badge’s free briefings
                              – Text “bluebadge” to 22828




SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust   18

Mais conteúdo relacionado

Mais procurados

Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandAndrew Brust
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Andrew Brust
 
NoSQL: An Analysis
NoSQL: An AnalysisNoSQL: An Analysis
NoSQL: An AnalysisAndrew Brust
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big DataAndrew Brust
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebookparallellabs
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databasesJames Serra
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisAndrew Brust
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIAndrew Brust
 
Scaling data on public clouds
Scaling data on public cloudsScaling data on public clouds
Scaling data on public cloudsLiran Zelkha
 
What's new in SQL Server 2017
What's new in SQL Server 2017What's new in SQL Server 2017
What's new in SQL Server 2017Hasan Savran
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBaseAnil Gupta
 
Self-Service ETL: The PowerBI Data Flows
Self-Service ETL: The PowerBI Data FlowsSelf-Service ETL: The PowerBI Data Flows
Self-Service ETL: The PowerBI Data FlowsData Con LA
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprisesmarkgrover
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLCloudera, Inc.
 

Mais procurados (20)

Big Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-LandBig Data and NoSQL in Microsoft-Land
Big Data and NoSQL in Microsoft-Land
 
NoSQL
NoSQLNoSQL
NoSQL
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
NoSQL: An Analysis
NoSQL: An AnalysisNoSQL: An Analysis
NoSQL: An Analysis
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Realtime Apache Hadoop at Facebook
Realtime Apache Hadoop at FacebookRealtime Apache Hadoop at Facebook
Realtime Apache Hadoop at Facebook
 
Relational databases vs Non-relational databases
Relational databases vs Non-relational databasesRelational databases vs Non-relational databases
Relational databases vs Non-relational databases
 
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth AnalysisCloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
Cloud Computing and the Microsoft Developer - A Down-to-Earth Analysis
 
Hitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BIHitchhiker’s Guide to SharePoint BI
Hitchhiker’s Guide to SharePoint BI
 
Apache HBase™
Apache HBase™Apache HBase™
Apache HBase™
 
Scaling data on public clouds
Scaling data on public cloudsScaling data on public clouds
Scaling data on public clouds
 
What's new in SQL Server 2017
What's new in SQL Server 2017What's new in SQL Server 2017
What's new in SQL Server 2017
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
Introduction To HBase
Introduction To HBaseIntroduction To HBase
Introduction To HBase
 
Self-Service ETL: The PowerBI Data Flows
Self-Service ETL: The PowerBI Data FlowsSelf-Service ETL: The PowerBI Data Flows
Self-Service ETL: The PowerBI Data Flows
 
Hadoop and Hive in Enterprises
Hadoop and Hive in EnterprisesHadoop and Hive in Enterprises
Hadoop and Hive in Enterprises
 
From Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETLFrom Raw Data to Analytics with No ETL
From Raw Data to Analytics with No ETL
 

Semelhante a Hadoop and its Ecosystem Components in Action

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Datacwensel
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User GroupCsaba Toth
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft PlatformJesus Rodriguez
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop StoryMichael Rys
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystemAndrew Brust
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopGeorge Ang
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopAllen Wittenauer
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Mark Rittman
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big DataJoe Alex
 

Semelhante a Hadoop and its Ecosystem Components in Action (20)

Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Processing Big Data
Processing Big DataProcessing Big Data
Processing Big Data
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Big Data in the Microsoft Platform
Big Data in the Microsoft PlatformBig Data in the Microsoft Platform
Big Data in the Microsoft Platform
 
Microsoft's Hadoop Story
Microsoft's Hadoop StoryMicrosoft's Hadoop Story
Microsoft's Hadoop Story
 
Brust hadoopecosystem
Brust hadoopecosystemBrust hadoopecosystem
Brust hadoopecosystem
 
Hadoop
HadoopHadoop
Hadoop
 
Deploying Grid Services Using Hadoop
Deploying Grid Services Using HadoopDeploying Grid Services Using Hadoop
Deploying Grid Services Using Hadoop
 
Deploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache HadoopDeploying Grid Services Using Apache Hadoop
Deploying Grid Services Using Apache Hadoop
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Hadoop by sunitha
Hadoop by sunithaHadoop by sunitha
Hadoop by sunitha
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop : Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
Enkitec E4 Barcelona : SQL and Data Integration Futures on Hadoop :
 
Apache Spark on HDinsight Training
Apache Spark on HDinsight TrainingApache Spark on HDinsight Training
Apache Spark on HDinsight Training
 
Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 

Mais de Andrew Brust

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabsAndrew Brust
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackAndrew Brust
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012Andrew Brust
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataAndrew Brust
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmAndrew Brust
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Andrew Brust
 

Mais de Andrew Brust (6)

Azure ml screen grabs
Azure ml screen grabsAzure ml screen grabs
Azure ml screen grabs
 
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stackBig Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
Big Data on the Microsoft Platform - With Hadoop, MS BI and the SQL Server stack
 
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012SQL Server Workshop for Developers - Visual Studio Live! NY 2012
SQL Server Workshop for Developers - Visual Studio Live! NY 2012
 
Power View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s DataPower View: Analysis and Visualization for Your Application’s Data
Power View: Analysis and Visualization for Your Application’s Data
 
Grasping The LightSwitch Paradigm
Grasping The LightSwitch ParadigmGrasping The LightSwitch Paradigm
Grasping The LightSwitch Paradigm
 
Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis Microsoft and its Competition: A Developer-Friendly Market Analysis
Microsoft and its Competition: A Developer-Friendly Market Analysis
 

Hadoop and its Ecosystem Components in Action

  • 1. SQL Server Live! Orlando 2012 Hadoop and its Ecosystem Components in Action Andrew Brust CEO and Founder Blue Badge Insights Level: Intermediate Meet Andrew • CEO and Founder, Blue Badge Insights • Big Data blogger for ZDNet • Microsoft Regional Director, MVP • Co-chair VSLive! and 17 years as a speaker • Founder, Microsoft BI User Group of NYC – http://www.msbinyc.com • Co-moderator, NYC .NET Developers Group – http://www.nycdotnetdev.com • “Redmond Review” columnist for Visual Studio Magazine and Redmond Developer News • brustblog.com, Twitter: @andrewbrust SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 1
  • 2. SQL Server Live! Orlando 2012 My New Blog (bit.ly/bigondata) MapReduce, in a Diagram Input mapper Output K1 Input mapper Output Input reducer Output Output K2 Input mapper Output Input reducer Output Input K3 Input mapper Output Input reducer Output Input mapper Output Input mapper Output SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 2
  • 3. SQL Server Live! Orlando 2012 A MapReduce Example • Count by suite, on each floor • Send per-suite, per platform totals to lobby • Sort totals by platform • Send two platform packets to 10th, 20th, 30th floor • Tally up each platform • Collect the tallies • Merge tallies into one spreadsheet What’s a Distributed File System? • One where data gets distributed over commodity drives on commodity servers • Data is replicated • If one box goes down, no data lost – Except the name node = SPOF! • BUT: HDFS is immutable – Files can only be written to once – So updates require drop + re-write (slow) SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 3
  • 4. SQL Server Live! Orlando 2012 Hadoop = MapReduce + HDFS • Modeled after Google MapReduce + GFS • Have more data? Just add more nodes to cluster. – Mappers execute in parallel – Hardware is commodity – “Scaling out” • Use of HDFS means data may well be local to mapper processing • So, not just parallel, but minimal data movement, which avoids network bottlenecks The Hadoop Stack Log file integration Machine Learning/Data Mining RDBMS Import/Export Query: HiveQL and Pig Latin Database MapReduce, HDFS SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 4
  • 5. SQL Server Live! Orlando 2012 Ways to work • Amazon Web Services Elastic MapReduce – Create AWS account – Select Elastic MapReduce in Dashboard • Microsoft Hadoop on Azure – Visit www.hadooponazure.com – Request invite • Cloudera CDH VM image – Download – Run via VMWare Player Amazon Elastic MapReduce • Lots of steps! • At a high level: – Setup AWS account and S3 “buckets” – Generate Key Pair and PEM file – Install Ruby and EMR Command Line Interface – Provision the cluster using CLI – Setup and run SSH/PuTTY – Work interactively at command line SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 5
  • 6. SQL Server Live! Orlando 2012 Amazon EMR – Prep Steps • Create an AWS account • Create an S3 bucket for log storage – with list permissions for authenticated users • Create a Key Pair and save PEM file • Install Ruby • Install Amazon Web Services Elastic MapReduce Command Line Interface – aka AWS EMR CLI  • Create credentials.json in EMR CLI folder – Associate with same region as where key pair created Amazon – Security and Startup • Security – Download PuTTYgen and run it – Click Load and browse to PEM file – Save it in PPK format – Exit PuTTYgen • In a command window, navigate to EMR CLI folder and enter command: – ruby elastic-mapreduce --create --alive [--num-instance xx] [--pig-interactive] [--hive-interactive] [--hbase --instance-type m1.large] • In AWS Console, go to EC2 Dashboard and click Instances on left nav bar • Wait until instance is running and get its Public DNS name – Use Compatibility View in IE or copy may not work SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 6
  • 7. SQL Server Live! Orlando 2012 Connect! • Download and run PuTTY • Paste DNS name of EC2 instance into hostname field • In Treeview, drill down and navigate to ConnectionSSHAuth, browse to PPK file • Once EC2 instance(s) running, click Open • Click Yes to “The server’s host key is not cached in the registry…” PuTTY Security Alert • When prompted for user name, type “hadoop” and hit Enter • cd bin, then hive, pig, hbase shell • Right-click to paste from clipboard; option to go full-screen • (Kill EC2 instance(s) from Dashboard when done) Amazon Elastic MapReduce SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 7
  • 8. SQL Server Live! Orlando 2012 Microsoft Hadoop on Azure • Much simpler • Browser-based portal – Provisioning cluster, managing ports, MapReduce jobs – External data from Azure BLOB storage • Interactive JavaScript console – HDFS, Pig, light data visualization • Interactive Hive console – Hive commands and metadata discovery • From Portal page you can RDP directly to Hadoop head node – Double click desktop shortcut for CLI access – Certain environment variables may need to be set Microsoft HDInsight SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 8
  • 9. SQL Server Live! Orlando 2012 Hadoop commands • HDFS – hadoop fs filecommand – Create and remove directories: mkdir, rm, rmr – Upload and download files to/from HDFS get, put – View directory contents ls, lsr – Copy, move, view files cp, mv, cat • MapReduce – Run a Java jar-file based job hadoop jar jarname params Hadoop (directly) SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 9
  • 10. SQL Server Live! Orlando 2012 HBase • Concepts: – Tables, column families – Columns, rows – Keys, values • Commands: – Definition: create, alter, drop, truncate – Manipulation: get, put, delete, deleteall, scan – Discovery: list, exists, describe, count – Enablement: disable, enable – Utilities: version, status, shutdown, exit – Reference: http://wiki.apache.org/hadoop/Hbase/Shell HBase Examples • create 't1', 'f1', 'f2', 'f3' • describe 't1' • alter 't1', {NAME => 'f1', VERSIONS => 5} • put 't1', 'r1', 'c1', 'value', ts1 • get 't1', 'r1' • count 't1' SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 10
  • 11. SQL Server Live! Orlando 2012 HBase Hive • Used by most BI products which connect to Hadoop • Provides a SQL-like abstraction over Hadoop – Officially HiveQL, or HQL • Works on own tables, but also on HBase • Query generates MapReduce job, output of which becomes result set • Microsoft has Hive ODBC driver – Connects Excel, Reporting Services, PowerPivot, Analysis Services Tabular Mode (only) SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 11
  • 12. SQL Server Live! Orlando 2012 Hive, Continued • Load data from flat HDFS files – LOAD DATA LOCAL INPATH './examples/files/kv1.txt‘ OVERWRITE INTO TABLE pokes; • SQL Queries – CREATE, ALTER, DROP – INSERT OVERWRITE (creates whole tables) – SELECT, JOIN, WHERE, GROUP BY – SORT BY, but ordering data is tricky! – USING allows for streaming on map, reduce steps Hive SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 12
  • 13. SQL Server Live! Orlando 2012 Pig • Instead of SQL, employs a language (“Pig Latin”) that accommodates data flow expressions – Do a combo of Query and ETL • “10 lines of Pig Latin ≈ 200 lines of Java.” • Works with structured or unstructured data • Operations – As with Hive, a MapReduce job is generated – Unlike Hive, output is only flat file to HDFS – With MS Hadoop, can easily convert to JavaScript array, then manipulate • Use command line (“Grunt”) or build scripts Example • A = LOAD ‘myfile’ AS (x, y, z); B = FILTER A by x > 0; C = GROUP B BY x; D = FOREACH A GENERATE x, COUNT(B); STORE D INTO ‘output’; SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 13
  • 14. SQL Server Live! Orlando 2012 Pig Latin Examples • Imperative, file system commands – LOAD, STORE Schema specified on LOAD • Declarative, query commands (SQL-like) – xxx = table/file – FOREACH xxx GENERATE (SELECT…FROM xxx) – JOIN (WHERE/INNER JOIN) – FILTER xxx BY (WHERE) – ORDER xxx BY (ORDER BY) – GROUP xxx BY / GENERATE COUNT(xxx) (SELECT COUNT(*) GROUP BY) – DISTINCT (SELECT DISTINCT) • Syntax is assignment statement-based: – MyCusts = FILTER Custs BY SalesPerson eq 15 • COGROUP, UDFs Pig SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 14
  • 15. SQL Server Live! Orlando 2012 Sqoop sqoop import --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <from_table> --target-dir <to_hdfs_folder> --split-by <from_table_column> Sqoop sqoop export --connect "jdbc:sqlserver://<servername>. database.windows.net:1433; database=<dbname>; user=<username>@<servername>; password=<password>" --table <to_table> --export-dir <from_hdfs_folder> --input-fields-terminated-by "<delimiter>" SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 15
  • 16. SQL Server Live! Orlando 2012 Flume NG • Source – Avro (data serialization system – can read json- encoded data files, and can work over RPC) – Exec (reads from stdout of long-running process) • Sinks – HDFS, HBase, Avro • Channels – Memory, JDBC, file Flume NG • Setup conf/flume.conf # Define a memory channel called ch1 on agent1 agent1.channels.ch1.type = memory # Define an Avro source called avro-source1 on agent1 and tell it # to bind to 0.0.0.0:41414. Connect it to channel ch1. agent1.sources.avro-source1.channels = ch1 agent1.sources.avro-source1.type = avro agent1.sources.avro-source1.bind = 0.0.0.0 agent1.sources.avro-source1.port = 41414 # Define a logger sink that simply logs all events it receives # and connect it to the other end of the same channel. agent1.sinks.log-sink1.channel = ch1 agent1.sinks.log-sink1.type = logger # Finally, now that we've defined all of our components, tell # agent1 which ones we want to activate. agent1.channels = ch1 agent1.sources = avro-source1 agent1.sinks = log-sink1 • From the command line: flume-ng agent --conf ./conf/ -f conf/flume.conf -n agent1 SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 16
  • 17. SQL Server Live! Orlando 2012 Mahout Algorithms • Recommendation – Your info + community info – Give users/items/ratings; get user-user/item-item – itemsimilarity • Classification/Categorization – Drop into buckets – Naïve Bayes, Complementary Naïve Bayes, Decision Forests • Clustering – Like classification, but with categories unknown – K-Means, Fuzzy K-Means, Canopy, Dirichlet, Mean- Shift Workflow, Syntax • Workflow – Run the job – Dump the output – Visualize, predict • mahout algorithm -- input folderspec -- output folderspec -- param1 value1 -- param2 value2 … • Example: – mahout itemsimilarity --input <input-hdfs-path> --output <output-hdfs-path> --tempDir <tmp-hdfs-path> -s SIMILARITY_LOGLIKELIHOOD SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 17
  • 18. SQL Server Live! Orlando 2012 Resources • Big On Data blog – http://www.zdnet.com/blog/big-data • Apache Hadoop home page – http://hadoop.apache.org/ • Hive & Pig home pages – http://hive.apache.org/ – http://pig.apache.org/ • Hadoop on Azure home page – https://www.hadooponazure.com/ • SQL Server 2012 Big Data – http://bit.ly/sql2012bigdata Thank you • andrew.brust@bluebadgeinsights.com • @andrewbrust on twitter • Get Blue Badge’s free briefings – Text “bluebadge” to 22828 SQTH12 ‐ Hadoop and its Ecosystem Components in Action ‐ Andrew Brust 18