Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Hadoop in a Windows Shop - CHUG - 20120416
1. Hadoop in a Windows Shop
Abuna Demoz – Abuna@AdGooroo.com
Brad Vah – Bvah@AdGooroo.com
Mike Schiro – Mschiro@AdGooroo.com
Twitter: @AdGooroo @abuna
2. Who Is AdGooroo?
• Founded in 2004
• We are the largest provider of Search Intelligence in the world
• Our customers include:
– Agencies
– CMOs
– Marketing Managers
– Digital Ad Sales
– Over 4,000 users
• Global Scale
– 50 Countries
– 14 Search Engines
– 14 Ad Networks
7. Learning Curve
• Where is Hadoop going to fit?
• How do we leverage existing tools?
• Linux can be less forgiving
– rm –rf /*
• Who names these things?
8. Integration Points
• Active Directory != LDAP
• Create a seamless user experience
• Domjoin in 30 simple steps
– Tip: It’s usually safe to blame Kerberos
9. Integration Points – Data Transfer
• SMB works…mostly
– Flaky connectivity
– Relatively slow transfer for GigE
• NFS
– Client Services for NFS
– Much faster transfer speeds
10. Integration Points – Data Transfer
• MountableHDFS/HDFS_Fuse
– Fuse -> NFS -> Windows
• We tried it. You should not.
– SCP (Windows) -> NFS -> Fuse
• Messy, but it works.
• Don’t often need to use it
11. Monitoring and Management
• Operations Manager (MOM/SCOM)
– Native Linux monitoring
– Custom Management packs for Hadoop
• Opalis
– Workflow automation
• Configuration Manager (SCCM)
– Quest Management Xtensions for *nix
12. Final Thoughts
• Hadoop and Windows can live together.
• Microsoft is starting to figure out this
whole “open-source” thing.
– MSSQL connectors for Hadoop
– ODBC driver for Hive
– Interop initiatives
• When in doubt; blame Kerberos.
• Roll your own repo.
14. Environments
• Windows
– Visual Studio, SQL Server, etc
– Physical workstations
• Linux
– Getting reacquainted with an old friend
– New suite of tools
– Cloudera VM
• RAMRAMRAMRAMRAMRAMRAMRAMRAM
15. Languages
• Java
– Straightforward transition from the .NET world
– Hmm…How do I create that JAR again?
• Python/Bash
– Utilized a lot more than expected
• HiveQL
– Simple transition from SQL
– Custom UDFs
16. Unexpected Roadblocks - AVRO
• Assumption:
– Works with .NET
• Can serialize files to be read by Java Map/Reduce
• Reality:
– .NET compatibility not fully baked
• Any files written in .NET could not be read in Java.
– C# side is not reading nor writing the header
– JIRA: AVRO-823
17. Unexpected Roadblocks – Flume
• Assumption:
– We’ll use Flume for Windows
• Reality:
– Overkill for our needs
– Implementation woes
• Solution:
– Custom log collector service
– Converts data to AVRO file
18. Unexpected Roadblocks – Thrift
• Assumption:
– We’ll use Thrift to talk to HBase from .NET
• Reality:
– HBase.thrift does not support C# yet
• Solution:
– Convert Thrift Java code-gen to .NET
• Some community work already done here
(https://bitbucket.org/vadim/hbase-sharp)
19. As Advertised - Sqoop
• Simple
• Fast route to POC
– Imports
– Exports
• Minor “gotchas”
– Delimiters
– Large exports to SQL Server
• Use “--batch” mode
20. As Advertised - Hive
• Very similar to SQL
• “Quick” data analysis
– Results without crippling your existing RDBMS
• HBase storage handler
– provides easy point of entry to data and data
manipulation
21. Final Thoughts
• Don’t overthink it!
– Just because you can doesn’t mean you should
• Modularity
– Easy to be overwhelmed by all the moving parts
– Flatten the learning curve by taking it one piece at
a time