Evolution of Big Data at Intel - Crawl, Walk and Run Approach
1. Evolution of Big Data at Intel - crawl, walk
and run approach
Gomathy Bala | Director
Chandhu Yalla | Manager & Architect
Key Contributors: Sonja Sandeen, Seshu Edala, Nghia Ngo and Darin Watson
IT BI Big Data Team
Stream Processing or Complex Event Processing -- where small chunks of data come at rapid intervals [smaller quantum, requiring transformation]. E.g., Sensory data from manufacturing floors.
Batch Processing -- aggregated chunks of data, perhaps collected over a long span, waiting to be analyzed in one run. OLAP processing. E.g. Gold path analysis on intel.com
In-memory processing -- running interactive analytics over large batches of summary/factual data by leveraging the memory as the pre-emptive transient store. E.g. SQL aggregates/operational metrics from OLAP process
Machine Learning -- class of unsupervised and supervised learning techniques destined for a decision support or an expert system
Unsupervised Learning (No "response" variable. Just observations) -- tools Mahout
Clustering -- E.g. customer segmentation; clustering users by age, ethnicity, gender, income standards, geo, profession, and buying propensity to new form factors.
Frequent pattern mining -- E.g. co-branding strategies. People buying realsense cameras also downloading Intel XDK kits within 7 days of purchase.
Supervised Learning [predicting a "response" variable when encountering a new "condition". The response patterns learned from prior training sets of course…] -- H2O
Regression -- E.g. YoY growth for DCG Xeon co-processor shipment at 16% between 2011 and 2014. This year, we will ship 36 million units; current inventory levels at 23 mill
Classification -- E.g. Customer (Widgets Inc) responses to email automation and phone calls favorable in the last 3 months. Last upgrade was 2 years ago. The likelihood of an enterprise upgrade is "high".
Textual -- class of algorithms that "derive" meaning from what is otherwise flat left-to-right-top-to-bottom "text". Shred sentence structure into nouns-verbs-adjectives-adverbs; count entities and turn "text" into "terms" [features]. Encode the feature into a term-document or a "graph" representation so traditional analytics -- machine learning (supervised and unsupervised techniques may be applied). Lucene, SOLR is useful for indexing/tokenizing text; NLTK or Stanford parsers are useful to "tag" terms to class of linguistic tokens such as nouns and verbs. E.g. identify service management tickets that entail Windows 8.1 issues.
Log -- Logs are textual in syntax but do not possess linguistic rigor. Such contents are useful just indexing as is and searching. The machines do not "decode" meaning. Humans synthesize and add logical rules when the content is surfaced back via a search interface. E.g Logstash used to monitor errors in log4j logs of Hive jobs.
Spatial -- Class of problems that deal with spatial layout of entities. E.g. every die is sacred. Rationing and allocating sub-systems on a die via simulatory techniques to optimize wastage loss and maximize "premium" quality. Or optimizing lithographic etches that minimize orthogonal cuts by employing space-filling heuristics.
Statistical -- class of problems that infer patterns from data that exhibits stochastic characteristics -- e.g. identifying aggreations like stddev, min, max, avg yields of a graphics die; and performing outlier analysis.
Numerical -- class of problems that deal with data that exhibits deterministic characteristics -- e.g. Taguchi methods or iterative monte carlo methods that search and seek global minima/maxima. Genetic algorithms, deep learning methods/neural networks etc.
Time-series -- class of problems that deal with data that exhibits stochasticity, but also exhibits temporal/seasonal resonance patterns. E.g. noise-cancellation filters that employ feedback loops; or predicting stock-price movement etc
Graph -- class of problems that compute statistics about entities connected to other entities. E.g. computing pagerank/link-popularity of a web page, congestion patterns of a traffic flow, sewage system planning etc
Storage Models
Textual/Binary -- No DDL. All data is stored row-first, column-next where there is only one BLOB column per row. E.g Zip files, MainFrames
Relational -- well specified DDL, but data is stored row-first [co-located fields of a row]; locking semantics at row level. Yields faster entity retrievals but poorer compression ratios when heterogeous fields co-exist in data. The index is built for row-offsets; e.g. -- Oracle, MySQL
Columnar -- well specified DDL; but data is stored column-first [all first names are co-stored in ine file, last-names co-stored in another etc]; locking semantics at cell level. Yields faster aggregates [min, max on a single field], better compression ratios [because all fields of a columnar file are a homogenous type]. But lacks atomic consistency because a record change transpires into mutations in multiple "columnar/co-location" files. E.g. HBase, Cassandra
Hierarchy -- well specified structural definition. Mostly follows a denormalized parent-child taxonomy. All fields relevant to a record are stored as a "hierarchic document" ala XML or JSON document. Yields a great consistency model because the grain of the data is a "document". Any mutation will always mean a complete denormalized update of the full document -- json or xml. E.g. MongoDB, CouchDBGraphDB -- native adjacency property graph that stores entities as "vertices" of a graph, relations as "edges", and attributes as "properties". Since indices are combinatorially developed on all -- entities, relations, and attributes -- adjacency mining, filtering, mutations are performant and atomic. E.g. Neo4J, TitanDB
SLIDE PURPOSE: Who Are We … we are the IT organization at Intel (IT@Intel) .. Core background information on Intel IT and our mission/goals/capabilities
Key Messages:
We are the IT organization Inside Intel’s Business.
Our organization is large, diverse multi-national enterprise with a wide variety of operational requirements and needs
Our Vision is to accelerate Intel’s quest to connect and enrich the lives of every person on Earth by the end of the decade.
Our Mission is to Grow Intel’s Business through Information Technology for Intel by facilitating IT Consumerization, delivering IT efficiency and continuity through Cloud Computing, increase employee productivity through seamless connectivity and Security, provide significant business value through Business Intelligence initiatives and drive increased collaboration through Social Computing.
Review some of the Information/Key Stats shown here.
Size and Location: 6,334 IT employees … Supporting over 98,000 employees.
Note: Intel IT only reflects the number of employees we support directly (we exclude Intel employees who support wholly owned subsidiaries) Remote Support is Vital.
Data Centers and Facilities: 59 Data Centers worldwide (down from 142 in 2007)
Need to confirm this data[~55,000 servers (down from 100,000 in 2007) consuming a large electrical and power/cooling load (roughly 55MW total power)
Our Data Centers also support 300M email messages (per month), >2,183 Terabytes WAN traffic (per month)]
and store 45 petabytes of raw storage capacity
Employee / Client Technology: Support over 147K devices (note >1 per employee ratio .. This ratio is growing with support of BYO and custom technology delivery to meet business needs)
>We have been 80%+ mobile PCs (laptops) as our core employee technology standard since 1997We have been actively evaluating, enabling and supporting many companion devices for improved productivity and flexibility
Need to add what we are doing with tablets - Janet
>43,200 Handhelds (variety of form factors (phones/tablets) vendors, software and solutions) the majority of these devices are now EMPLOYEE OWNED
Intel IT continues to embrace consumerization of IT and mobile applications are a major component of our strategy. We have delivered 57 mobile apps and counting to support new form factors. Our goal is to deliver a seamless, secure experience for our employees across a wide spectrum of devices by putting user experience first.
Enabled Leadership Business Capabilities:
Enable a top 25 supply chain (recognized by Gartner, previously AMR Research) . #25 in 2009, #18 in 2010, #16 in 2011, #7 in 2012 and #5 in 2013 key focus for IT innovation … delivered solid business results and competitive differentiation for Intel
Additional fun facts …
100% Intel laptops support SSD and 100% are deployed with disk encryption