This document summarizes Monsanto's experiences using Hadoop for big data analytics in agriculture. Hadoop allows Monsanto to store and analyze large volumes of genomic, yield, and sensor data to increase crop yields. Lessons learned include starting small with a focus on business problems, using Hadoop for ETL pipelines and long-term storage, and addressing security, backup, and data management challenges as Hadoop and its ecosystem continue to evolve.
Scanning the Internet for External Cloud Exposures via SSL Certs
Harvesting Big Data in Agriculture with Hadoop
1. Harvesting Big Data in Agriculture
Experiences with Hadoop
Erich Hochmuth
R&D IT Big Data & Analytics Lead
erich.hochmuth@monsanto.com
2. Monsanto Serves Farmers Around the World
Working With Growers Large and Small, Row Crops and Vegetables
3. Our Approach to Driving Yield
A System of Agriculture Working Together to Boost Productivity
BREEDING BIOTECHNOLOGY AGRONOMICS
The art and science The science of improving The farm management
of combining genetic plants by inserting genes practices involved in
material to produce a new into their DNA growing plants
seed
4. Increasing Yield through Big Data
At the Cornerstone of Yield Increases is Information & Analytics
Increased Yield
Variety Volume Velocity
• Raw Sequence data • PBs of NGS data • 10’s millions yield dps/day
• Unstructured sensor data • 10’s TBs of genomic data • 100’s million genotyping dps/day
• Relational yield data • TBs of yield data • TBs of NGS data/week
• Poly-structured genomic data • Billions of genotyping dps
• Spatial data
• Satellite imagery
5. Why Hadoop?
• Focus on solving the business problem & not building IT solutions
• Commodity solution for the easy (data parallel) stuff
• Remove the hand off between developers & strategic scientist
• Cost to generate & store data continues to decrease
• Eliminate the constant churn to scale existing solution
• Cost effective incremental platform expansion
6. Hadoop as an ETL Platform
Scientific Instrumentation
Data Processing Summarized Results
7. Hadoop as a Queryable Archive
Long term storage Discovery
Historic Data
12. Hadoop Implementation/Deployment
• It Takes a Team
• Practices makes perfect
• Fit into existing process or
standards when possible
– Deviated when necessary
• Know your use case!
• Capacity Planning
• Start small & build on success
13. Hadoop Security
• Research data is IP
• Hadoop is system of record for some data
• Spent 6 weeks configuring Hadoop security
– Sought outside help
– Successful installation not consistently reproducible
– Support inconsistent across ecosystem
• Adopted more traditional Hadoop security approach
• HTTP edge services augmented with corporate single sign-on
• Integrated into corporate LDAP
• Revisit when Hadoop security becomes stable
14. Backup & Restore
• Doesn’t Hadoop have built in replication?
• Requirements
– Backup HBase & HDFS
– Weekly full backups
– Daily incremental
– Offsite data & retain for 60 days
• Rolled our own
– Dedicated backup cluster
– DistCp data to backup cluster
– Copy data via Fuse-DFS to tape
– Manual restore & merge
• Considering replicating to offside DR cluster
– No more tape backups!
15. Data Management….or lack there of!
• Current Approach
– Data grouped into subject areas
– Utilize HDFS Quotas
– Access controlled through AD groups
– Supplement with governance & process
• Needs
– Publish & share known schemas
– Common schema across tool set
– Fine grained authorization
– Monitoring/alerting of data access
– Track data lineage
16. Conclusion
• Enterprise ready?
• Support?
– Open Source Community
• Documentation
– Missouri is “The Show Me State”
• Evolving third party support
• Hadoop resources in the Midwest?
• Know your use case!
17. Thank you!
We are hiring!
erich.hochmuth@monsanto.com