Harvesting Big Data in Agriculture with Hadoop

Harvesting Big Data in Agriculture

Experiences with Hadoop

Erich Hochmuth
R&D IT Big Data & Analytics Lead
erich.hochmuth@monsanto.com

Monsanto Serves Farmers Around the World
Working With Growers Large and Small, Row Crops and Vegetables

Our Approach to Driving Yield
A System of Agriculture Working Together to Boost Productivity

BREEDING BIOTECHNOLOGY AGRONOMICS

The art and science The science of improving The farm management
of combining genetic plants by inserting genes practices involved in
material to produce a new into their DNA growing plants
seed

Increasing Yield through Big Data
At the Cornerstone of Yield Increases is Information & Analytics
Increased Yield

Variety Volume Velocity

• Raw Sequence data • PBs of NGS data • 10’s millions yield dps/day
• Unstructured sensor data • 10’s TBs of genomic data • 100’s million genotyping dps/day
• Relational yield data • TBs of yield data • TBs of NGS data/week
• Poly-structured genomic data • Billions of genotyping dps
• Spatial data
• Satellite imagery

Why Hadoop?

• Focus on solving the business problem & not building IT solutions

• Commodity solution for the easy (data parallel) stuff

• Remove the hand off between developers & strategic scientist

• Cost to generate & store data continues to decrease

• Eliminate the constant churn to scale existing solution

• Cost effective incremental platform expansion

Hadoop as an ETL Platform

Scientific Instrumentation

Data Processing Summarized Results

Hadoop as a Queryable Archive

Long term storage Discovery
Historic Data

HBase
Real-time Access

OLAP

Technical Landscape
• 3 clusters (Dev/Test, QA, & Prod)
• 2 backup clusters
• Combined HBase & MapReduce
• Access via Edge Services
• Resources partitioned by workflows
– Data & compute

Hadoop Ecosystem @ Monsanto
Web Portal (HUE)

Workflow (Oozie) Scheduling (Fair Scheduler)

Data Integration (Sqoop) Real-time access (HBase)

Languages/Compilers
Serialization (Avro)
(Pig)

Coordination (Zookeeper)

In Use Planned Very Interested In
• Hadoop MR • Hue • Hive • HCatalog
• HBase • Stargate/HBase REST • RHadoop • Flume
• Oozie • Fair Scheduler • YARN
• Zookeeper • Pig
• Sqoop
• Quest Connector

Hadoop Implementation/Deployment
• It Takes a Team

• Practices makes perfect

• Fit into existing process or
standards when possible
– Deviated when necessary

• Know your use case!

• Capacity Planning

• Start small & build on success

Hadoop Security
• Research data is IP

• Hadoop is system of record for some data

• Spent 6 weeks configuring Hadoop security
– Sought outside help
– Successful installation not consistently reproducible
– Support inconsistent across ecosystem

• Adopted more traditional Hadoop security approach

• HTTP edge services augmented with corporate single sign-on

• Integrated into corporate LDAP

• Revisit when Hadoop security becomes stable

Backup & Restore
• Doesn’t Hadoop have built in replication?

• Requirements
– Backup HBase & HDFS
– Weekly full backups
– Daily incremental
– Offsite data & retain for 60 days

• Rolled our own
– Dedicated backup cluster
– DistCp data to backup cluster
– Copy data via Fuse-DFS to tape
– Manual restore & merge

• Considering replicating to offside DR cluster
– No more tape backups!

Data Management….or lack there of!
• Current Approach
– Data grouped into subject areas
– Utilize HDFS Quotas
– Access controlled through AD groups
– Supplement with governance & process

• Needs
– Publish & share known schemas
– Common schema across tool set
– Fine grained authorization
– Monitoring/alerting of data access
– Track data lineage

Conclusion
• Enterprise ready?
• Support?
– Open Source Community
• Documentation
– Missouri is “The Show Me State”
• Evolving third party support
• Hadoop resources in the Midwest?
• Know your use case!

Thank you!

We are hiring!
erich.hochmuth@monsanto.com

Harvesting Big Data in Agriculture with Hadoop

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (10)

Semelhante a Harvesting Big Data in Agriculture with Hadoop

Semelhante a Harvesting Big Data in Agriculture with Hadoop (20)

Mais de StampedeCon

Mais de StampedeCon (20)

Último

Último (20)

Harvesting Big Data in Agriculture with Hadoop