This document outlines Canburak Tümer's education and work experience, and provides an agenda for a presentation on big data, NoSQL, Hadoop, HDFS, MapReduce, management tools, data access tools, and data processing/mining tools. It introduces Hadoop as an open source big data platform started by Yahoo developers, and discusses its main components including HDFS for storage, MapReduce for processing, and how Spark is replacing MapReduce.
9. - Open source big data platform
- Started by developers from Yahoo!
- Two main distributors now : Cloudera, Hortonworks
- Both storage and processing
- HDFS for storage
- MapReduce for processing
- Spark engine is replacing MapReduce day by day
3 Vs
Volume : Big data is BIG, usually between multi GB – multi TB
Velocity : Data flows fast like web server logs of Google, Facebook or CDR logs of telecom firms.
Variety : Data is not only text or number, it includes pictures, videos, weblogs etc.
Additional 2 Vs
Verification : This one is about data quality and reliability
Value : This is the aim of processing big data. Finding diamond in coal mine... (Coal : 344TL/tonne, Diamond: 29.330TL/Carat (0.2gr) (color D, Clearity : VS1)
CDRDM, daily record per table > 250M, just operational reports
MINING,
No-SQL is not, no to SQL. It is «not only SQL» meaning a new way to store and access data.
Most noSQL DB’s are key value pairs, that can extend horizontally, and records does not have to be identical like RDB’s
There are also Graph DB’s mostly used for social networks.
Best knowns are HBase, Cassandra and MongoDB
Mongo DB is fast but not reliable. Used by Twitter etc.
Cassandra used by, GitHub, eBay, Instagram, Netflix etc.
Hbase open source implementation of Google Bigtable
Hadoop is named after first engineer’s son’s plush elephant toy.
Started after Googles GFS and MapReduce papers.
Implemented in Yahoo first, then a team left Yahoo and founded Cloudera
Secondary Name Node is just a copy of NameNode but can’t serve as NameNode on failure, only holds configuration. (Now it acts like standby server)
Client only knows about NameNode, other structure is invisible to user.
Data nodes hold file chunks.
It is namenode’s duty to split and put files into datanodes
Distributed processing library
Every record starts a map task, and every pyhsical machine starts a reduce task to consolidate map results.
Between steps there are intermediate files which are cleaned after whole process completes.
Ambari is another managing and monitoring tool targeted to Hadoop clusters.
Distributed systems configuration and management tool. Also known as coordination service.
Scheduling manager for Hadoop Also known as flow design tool.
Yet Another Resource Negotiator Also know as MapReduce 2.0
Scripting language for querying data Starts MAP and REDUCE jobs
Hive Querying language is SQL like querying language, supports most ANSI SQL functions Starts MAP and REDUCE jobs
Faster alternative to Hive Works directly on HDFS
Bulk data transfer tool between Hadoop and RDBMS’s
A tool for gathering streaming files like log files.
Nutch started before even Google published its papers. Then moved on Hadoop. It is based on Apache Lucene. Both runs standalone or distributed. Web crawler for hadoop.
Machine Learning library for Hadoop Started as a library for Map Reduce Now supports Spark also
A new distributed processing platform Can be used in collaboration with HDFS Runs in memory, InMemory performance is 100x better than MapReduce, in case of disk spillage it is still 10x, Development started in 2009 in Berkeley, became Spark on 2013
Real time computation engine Hadoop processes data in batch, storm instead processes real time