SlideShare uma empresa Scribd logo
1 de 22
Hadoop in a Windows Shop


            Abuna Demoz – Abuna@AdGooroo.com
            Brad Vah – Bvah@AdGooroo.com
            Mike Schiro – Mschiro@AdGooroo.com
            Twitter: @AdGooroo @abuna
Who Is AdGooroo?
• Founded in 2004
• We are the largest provider of Search Intelligence in the world
• Our customers include:
   –   Agencies
   –   CMOs
   –   Marketing Managers
   –   Digital Ad Sales
   –   Over 4,000 users
• Global Scale
   – 50 Countries
   – 14 Search Engines
   – 14 Ad Networks
AdGooroo Insight Suite™




©2012 AdGooroo, LLC. All Rights Reserved.
Paid Search




Natural Search
Why we deployed Hadoop
Hadoop Administration
Learning Curve

• Where is Hadoop going to fit?
• How do we leverage existing tools?
• Linux can be less forgiving
  – rm –rf /*
• Who names these things?
Integration Points

• Active Directory != LDAP
• Create a seamless user experience
• Domjoin in 30 simple steps
  – Tip: It’s usually safe to blame Kerberos
Integration Points – Data Transfer

• SMB works…mostly
  – Flaky connectivity
  – Relatively slow transfer for GigE
• NFS
  – Client Services for NFS
  – Much faster transfer speeds
Integration Points – Data Transfer

• MountableHDFS/HDFS_Fuse
  – Fuse -> NFS -> Windows
    • We tried it. You should not.
  – SCP (Windows) -> NFS -> Fuse
    • Messy, but it works.
    • Don’t often need to use it
Monitoring and Management

• Operations Manager (MOM/SCOM)
  – Native Linux monitoring
  – Custom Management packs for Hadoop
• Opalis
  – Workflow automation
• Configuration Manager (SCCM)
  – Quest Management Xtensions for *nix
Final Thoughts

• Hadoop and Windows can live together.
• Microsoft is starting to figure out this
  whole “open-source” thing.
  – MSSQL connectors for Hadoop
  – ODBC driver for Hive
  – Interop initiatives
• When in doubt; blame Kerberos.
• Roll your own repo.
Hadoop Development
Environments

• Windows
  – Visual Studio, SQL Server, etc
  – Physical workstations
• Linux
  – Getting reacquainted with an old friend
  – New suite of tools
  – Cloudera VM
     • RAMRAMRAMRAMRAMRAMRAMRAMRAM
Languages

• Java
  – Straightforward transition from the .NET world
  – Hmm…How do I create that JAR again?
• Python/Bash
  – Utilized a lot more than expected
• HiveQL
  – Simple transition from SQL
  – Custom UDFs
Unexpected Roadblocks - AVRO

• Assumption:
  – Works with .NET
     • Can serialize files to be read by Java Map/Reduce

• Reality:
  – .NET compatibility not fully baked
     • Any files written in .NET could not be read in Java.
        – C# side is not reading nor writing the header
        – JIRA: AVRO-823
Unexpected Roadblocks – Flume
• Assumption:
  – We’ll use Flume for Windows

• Reality:
  – Overkill for our needs
  – Implementation woes

• Solution:
  – Custom log collector service
  – Converts data to AVRO file
Unexpected Roadblocks – Thrift
• Assumption:
  – We’ll use Thrift to talk to HBase from .NET

• Reality:
  – HBase.thrift does not support C# yet

• Solution:
  – Convert Thrift Java code-gen to .NET
     • Some community work already done here
       (https://bitbucket.org/vadim/hbase-sharp)
As Advertised - Sqoop
• Simple
• Fast route to POC
  – Imports
  – Exports
• Minor “gotchas”
  – Delimiters
  – Large exports to SQL Server
     • Use “--batch” mode
As Advertised - Hive

• Very similar to SQL
• “Quick” data analysis
  – Results without crippling your existing RDBMS
• HBase storage handler
  – provides easy point of entry to data and data
    manipulation
Final Thoughts
• Don’t overthink it!
  – Just because you can doesn’t mean you should

• Modularity
  – Easy to be overwhelmed by all the moving parts
  – Flatten the learning curve by taking it one piece at
    a time
We’re Hiring


jobs@adgooroo.com

abuna@adgooroo.com
bvah@adgooroo.com
mschiro@adgooroo.com

Mais conteúdo relacionado

Mais procurados

Using flash on the server side
Using flash on the server sideUsing flash on the server side
Using flash on the server side
Howard Marks
 
Redis Everywhere - Sunshine PHP
Redis Everywhere - Sunshine PHPRedis Everywhere - Sunshine PHP
Redis Everywhere - Sunshine PHP
Ricard Clau
 
Redis everywhere - PHP London
Redis everywhere - PHP LondonRedis everywhere - PHP London
Redis everywhere - PHP London
Ricard Clau
 
Website performance optimization QA
Website performance optimization QAWebsite performance optimization QA
Website performance optimization QA
Denis Dudaev
 

Mais procurados (18)

Be faster then rabbits
Be faster then rabbitsBe faster then rabbits
Be faster then rabbits
 
Cache all the things #DCLondon
Cache all the things #DCLondonCache all the things #DCLondon
Cache all the things #DCLondon
 
Using flash on the server side
Using flash on the server sideUsing flash on the server side
Using flash on the server side
 
Optimize drupal
Optimize drupalOptimize drupal
Optimize drupal
 
Hong Kong Drupal User Group - Sep 13th
Hong Kong Drupal User Group - Sep 13thHong Kong Drupal User Group - Sep 13th
Hong Kong Drupal User Group - Sep 13th
 
Drupal 7 performance and optimization
Drupal 7 performance and optimizationDrupal 7 performance and optimization
Drupal 7 performance and optimization
 
RavenDB embedded at massive scales
RavenDB embedded at massive scalesRavenDB embedded at massive scales
RavenDB embedded at massive scales
 
Redis Everywhere - Sunshine PHP
Redis Everywhere - Sunshine PHPRedis Everywhere - Sunshine PHP
Redis Everywhere - Sunshine PHP
 
Redis everywhere - PHP London
Redis everywhere - PHP LondonRedis everywhere - PHP London
Redis everywhere - PHP London
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
 
Freebsd, the unknown giant
Freebsd, the unknown giantFreebsd, the unknown giant
Freebsd, the unknown giant
 
Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015Modern software architectures - PHP UK Conference 2015
Modern software architectures - PHP UK Conference 2015
 
Caching 101 - WordCamp OC
Caching 101 - WordCamp OCCaching 101 - WordCamp OC
Caching 101 - WordCamp OC
 
ChinaNetCloud - Zabbix Monitoring System Overview
ChinaNetCloud - Zabbix Monitoring System OverviewChinaNetCloud - Zabbix Monitoring System Overview
ChinaNetCloud - Zabbix Monitoring System Overview
 
A faster web
A faster webA faster web
A faster web
 
OGDC Datastorage Solution_Mr.Dung, Dinh Nguyen Anh
OGDC Datastorage Solution_Mr.Dung, Dinh Nguyen AnhOGDC Datastorage Solution_Mr.Dung, Dinh Nguyen Anh
OGDC Datastorage Solution_Mr.Dung, Dinh Nguyen Anh
 
Website performance optimization QA
Website performance optimization QAWebsite performance optimization QA
Website performance optimization QA
 
Scaling High Traffic Web Applications
Scaling High Traffic Web ApplicationsScaling High Traffic Web Applications
Scaling High Traffic Web Applications
 

Semelhante a Hadoop in a Windows Shop - CHUG - 20120416

Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Andrew Brust
 
Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api Compatibility
Cloudera, Inc.
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
Andrew Brust
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
DataWorks Summit
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Midwest php 2013 deploying php on paas- why & how
Midwest php 2013   deploying php on paas- why & howMidwest php 2013   deploying php on paas- why & how
Midwest php 2013 deploying php on paas- why & how
dotCloud
 

Semelhante a Hadoop in a Windows Shop - CHUG - 20120416 (20)

Introduction to Hadoop and Big Data
Introduction to Hadoop and Big DataIntroduction to Hadoop and Big Data
Introduction to Hadoop and Big Data
 
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
Microsoft's Big Play for Big Data- Visual Studio Live! NY 2012
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 
Hw09 Security And Api Compatibility
Hw09   Security And Api CompatibilityHw09   Security And Api Compatibility
Hw09 Security And Api Compatibility
 
Microsoft's Big Play for Big Data
Microsoft's Big Play for Big DataMicrosoft's Big Play for Big Data
Microsoft's Big Play for Big Data
 
Top 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloudTop 10 lessons learned from deploying hadoop in a private cloud
Top 10 lessons learned from deploying hadoop in a private cloud
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the Cloud
 
Big Data Strategy for the Relational World
Big Data Strategy for the Relational World Big Data Strategy for the Relational World
Big Data Strategy for the Relational World
 
Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)Hadoop in Practice (SDN Conference, Dec 2014)
Hadoop in Practice (SDN Conference, Dec 2014)
 
Infrastructure Around Hadoop
Infrastructure Around HadoopInfrastructure Around Hadoop
Infrastructure Around Hadoop
 
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
The Nuts and Bolts of Hadoop and it's Ever-changing Ecosystem, Presented by J...
 
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding EdgeCIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
CIS13: Big Data Platform Vendor’s Perspective: Insights from the Bleeding Edge
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
PAC 2019 virtual Mark Tomlinson
PAC 2019 virtual Mark TomlinsonPAC 2019 virtual Mark Tomlinson
PAC 2019 virtual Mark Tomlinson
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
 
HyperDB, MySQL Performance, & Flavors of MySQL
HyperDB, MySQL Performance, & Flavors of MySQLHyperDB, MySQL Performance, & Flavors of MySQL
HyperDB, MySQL Performance, & Flavors of MySQL
 
Midwest php 2013 deploying php on paas- why & how
Midwest php 2013   deploying php on paas- why & howMidwest php 2013   deploying php on paas- why & how
Midwest php 2013 deploying php on paas- why & how
 
Infrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous DeliveryInfrastructure as Data with Ansible for easier Continuous Delivery
Infrastructure as Data with Ansible for easier Continuous Delivery
 

Mais de Chicago Hadoop Users Group

Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Chicago Hadoop Users Group
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
Chicago Hadoop Users Group
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
Chicago Hadoop Users Group
 

Mais de Chicago Hadoop Users Group (19)

Kinetica master chug_9.12
Kinetica master chug_9.12Kinetica master chug_9.12
Kinetica master chug_9.12
 
Chug dl presentation
Chug dl presentationChug dl presentation
Chug dl presentation
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
Using HBase Co-Processors to Build a Distributed, Transactional RDBMS - Splic...
 
Meet Spark
Meet SparkMeet Spark
Meet Spark
 
Choosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your BusinessChoosing the Right Big Data Architecture for your Business
Choosing the Right Big Data Architecture for your Business
 
An Overview of Ambari
An Overview of AmbariAn Overview of Ambari
An Overview of Ambari
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Advanced Oozie
Advanced OozieAdvanced Oozie
Advanced Oozie
 
Scalding for Hadoop
Scalding for HadoopScalding for Hadoop
Scalding for Hadoop
 
Financial Data Analytics with Hadoop
Financial Data Analytics with HadoopFinancial Data Analytics with Hadoop
Financial Data Analytics with Hadoop
 
Everything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about OozieEverything you wanted to know, but were afraid to ask about Oozie
Everything you wanted to know, but were afraid to ask about Oozie
 
An Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache HadoopAn Introduction to Impala – Low Latency Queries for Apache Hadoop
An Introduction to Impala – Low Latency Queries for Apache Hadoop
 
HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917HCatalog: Table Management for Hadoop - CHUG - 20120917
HCatalog: Table Management for Hadoop - CHUG - 20120917
 
Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604Map Reduce v2 and YARN - CHUG - 20120604
Map Reduce v2 and YARN - CHUG - 20120604
 
Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815Running R on Hadoop - CHUG - 20120815
Running R on Hadoop - CHUG - 20120815
 
Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416Avro - More Than Just a Serialization Framework - CHUG - 20120416
Avro - More Than Just a Serialization Framework - CHUG - 20120416
 

Último

Último (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 

Hadoop in a Windows Shop - CHUG - 20120416

  • 1. Hadoop in a Windows Shop Abuna Demoz – Abuna@AdGooroo.com Brad Vah – Bvah@AdGooroo.com Mike Schiro – Mschiro@AdGooroo.com Twitter: @AdGooroo @abuna
  • 2. Who Is AdGooroo? • Founded in 2004 • We are the largest provider of Search Intelligence in the world • Our customers include: – Agencies – CMOs – Marketing Managers – Digital Ad Sales – Over 4,000 users • Global Scale – 50 Countries – 14 Search Engines – 14 Ad Networks
  • 3. AdGooroo Insight Suite™ ©2012 AdGooroo, LLC. All Rights Reserved.
  • 7. Learning Curve • Where is Hadoop going to fit? • How do we leverage existing tools? • Linux can be less forgiving – rm –rf /* • Who names these things?
  • 8. Integration Points • Active Directory != LDAP • Create a seamless user experience • Domjoin in 30 simple steps – Tip: It’s usually safe to blame Kerberos
  • 9. Integration Points – Data Transfer • SMB works…mostly – Flaky connectivity – Relatively slow transfer for GigE • NFS – Client Services for NFS – Much faster transfer speeds
  • 10. Integration Points – Data Transfer • MountableHDFS/HDFS_Fuse – Fuse -> NFS -> Windows • We tried it. You should not. – SCP (Windows) -> NFS -> Fuse • Messy, but it works. • Don’t often need to use it
  • 11. Monitoring and Management • Operations Manager (MOM/SCOM) – Native Linux monitoring – Custom Management packs for Hadoop • Opalis – Workflow automation • Configuration Manager (SCCM) – Quest Management Xtensions for *nix
  • 12. Final Thoughts • Hadoop and Windows can live together. • Microsoft is starting to figure out this whole “open-source” thing. – MSSQL connectors for Hadoop – ODBC driver for Hive – Interop initiatives • When in doubt; blame Kerberos. • Roll your own repo.
  • 14. Environments • Windows – Visual Studio, SQL Server, etc – Physical workstations • Linux – Getting reacquainted with an old friend – New suite of tools – Cloudera VM • RAMRAMRAMRAMRAMRAMRAMRAMRAM
  • 15. Languages • Java – Straightforward transition from the .NET world – Hmm…How do I create that JAR again? • Python/Bash – Utilized a lot more than expected • HiveQL – Simple transition from SQL – Custom UDFs
  • 16. Unexpected Roadblocks - AVRO • Assumption: – Works with .NET • Can serialize files to be read by Java Map/Reduce • Reality: – .NET compatibility not fully baked • Any files written in .NET could not be read in Java. – C# side is not reading nor writing the header – JIRA: AVRO-823
  • 17. Unexpected Roadblocks – Flume • Assumption: – We’ll use Flume for Windows • Reality: – Overkill for our needs – Implementation woes • Solution: – Custom log collector service – Converts data to AVRO file
  • 18. Unexpected Roadblocks – Thrift • Assumption: – We’ll use Thrift to talk to HBase from .NET • Reality: – HBase.thrift does not support C# yet • Solution: – Convert Thrift Java code-gen to .NET • Some community work already done here (https://bitbucket.org/vadim/hbase-sharp)
  • 19. As Advertised - Sqoop • Simple • Fast route to POC – Imports – Exports • Minor “gotchas” – Delimiters – Large exports to SQL Server • Use “--batch” mode
  • 20. As Advertised - Hive • Very similar to SQL • “Quick” data analysis – Results without crippling your existing RDBMS • HBase storage handler – provides easy point of entry to data and data manipulation
  • 21. Final Thoughts • Don’t overthink it! – Just because you can doesn’t mean you should • Modularity – Easy to be overwhelmed by all the moving parts – Flatten the learning curve by taking it one piece at a time

Notas do Editor

  1. Insight Suite all