Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
Nell’iperspazio con Rocket: il Framework Web di Rust!
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory Research
1. A Big Data Refinery Built on HBase
Stanislav Barton
Internet Memory Research
2. A Content-oriented platform
• Mignify = a content-oriented platform which
– Continuously (almost) ingests Web documents
– Stores and preserves these documents as such AND
– Produces structured content extracted (from single
documents) and aggregated (from groups of documents)
– Organizes both raw documents, extracted and aggregated
information in a consistent information space.
=> Original documents and extracted information are uniformly
stored in HBase
3. A Service-oriented platform
• Mignify = a service-oriented platform
– Physical storage layer on “custom hardware”
– Crawl on-demand, with sophisticated navigation options
– Able to host third-party extractors, aggregators and
classifiers
– Run new algorithms on existing collections as they arrive
– Supports search, navigation, on-demand query and
extraction
=> A « Web Observatory » built around Hadoop/Hbase.
–
4. Customers/Users
• Web Archivists – store and organize, live
search
• Search engineers – organize and refine, big
throughput
• Data miners – refine, secondary indices
• Researchers – store, organize and/or refine
5. Talk Outline
• Architecture Overview
• Use Cases/Scenarios
• Data model
• Queries/ Query language / Query processing
• Usage Examples
• Alternative HW platform
7. Typical scenarios
• Full text indexers
– Collect documents, with metadata and graph
information
• Wrapper extraction
– Get structured information from web sites
• Entity annotation
– Annotate documents with entity references
• Classification
– Aggregate subsets (e.g., domains); assign them to
topics
8. Data in HBase
• HBase as a first choice data store:
– Inherent versioning (timestamped values)
– Real time access (Cache, index, key ordering)
– Column oriented storage
– Seamless integration with Hadoop
– Big community
– Production ready/mature implementation
9. Data model
• Data stored in one big table along with
metadata and extraction results
• Though separated in column families
– CFs as secondary indices
• Raw data stored in HBase ( < 10 MB, HDFS
otherwise)
• Data stored as rows (versioned)
• Values are typed
10. Types and schema,
where and why
• Initially, our collections consists of purely non-
structured data
• All the point for Mignify is to produce a
backbone of structured information,
• Done through an iterative process which
progressively builds a rich set of interrelated
annotations
• Typing = important to ensure the safety of the
whole process, automation
11. Data model II
A= <CF, Qualifier>:Type
Schema Type:Writable or byte[]
Resource Version1:(A1.. Ak), Version2:(Ak+1, … Am),…
mignify
Versions <t’,{V1, … Vk}>, <t’’,{Vk+1, … Vm}>,…<t’’’, Vm+1,… Vl>
<(CF1,Qa,t’),v1>,<(CF1,Qb,t’’),v2>, … <(CFn,Qz,t’’’),vm>
HTable v1,… vm:byte[]
HBase
HFiles CF1 CF2 CFn
12. Extraction Platform
Main Principles
A framework for specifying data extraction from very
large datasets
Easy integration and application of new extractors
High level of genericity in terms of (i) data sources,
(ii) extractors, and (iii) data sinks
An extraction process specification combines these
elements
[currently] A single extractor engine
Based on the specification, data extraction is processed by
a single, generic MapReduce job.
13. Extraction Platform
Main Concepts
Important: typing (We care about types and schemas!)
Input and Output Pipes
Declare data source (e.g., a HBase collection) and data
sinks (e.g, Hbase, HDFS, csv, …)
Filters (Boolean operators that apply to input data)
Extractors
Takes an input Resource, produce Features
Views
Combination of input and output pipes, filters and
extractors
14. Data Queries
• Various data sources (HTable, data files,…)
• Projections using column families and qualifiers
• Selections by HBase filters
FilterList ret = new FilterList(FilterList.Operator.MUST_PASS_ONE);
Filter f1 = new SingleColumnValueFilter(Bytes.toBytes("meta"), Bytes.toBytes("detectedMime"),
CompareFilter.CompareOp.EQUAL, Bytes.toBytes(“text/html”));
ret.addFilter(f1);
• Query results either flushed to files or back to
HBase –> materialized views
• Views are defined per collection as a set of pairs
of Extractors (user defined function) and Filters
15. Query Language (DML)
• For each employee with salary higher than 2,000,
compute total costs
– SELECT f(hire_date, salary) FROM employees WHERE
salary >= 2000
– f(hire_date, salary)= mon(today-hire_date)*salary
• For each web page detect mime type, language
– For each RSS feed, get summary,
– For each HTML page extract plain text
• Currently a wizzard producing a JSON doc
16. User functions
• List<byte[]> f(Row r)
• May calculate new attribute values, stored
with Row and reused by other function
• Execution plan: order matters!
• Function associates description of input and
output fields
– Fields dependencies give order
17. Input Pipes
• Defines how to get the data
– Archive files, Text files, HBase table
– Format
– The Mappers have always Resource on input,
several custom InputFormats and RecordReaders
18. Output pipes
• Defines what to do with (where to store) the
query result
– File, Table
– Format
– What columns to materialize
• Most of the time PutSortReducer used so OP
defines OutputFormat and what to emit
19. Query Processing
• With typed data, Resource wrapper and IPs and OPs – one
universal MapReduce job to execute/process queries!!
• Most efficient (for data insertion): MapReduce job with
Custom Mapper and PutSortReducer
• Job init: build combined filter, IP and OP to define input and
output formats
• Mapper set-up: Init plan of user function applications, init
of functions themselves
• Mapper map: apply functions on a row according to plan,
use OP to emit values
• Not all combinations can leverage the PutSortReducer
(writing to one table at a time, …)
21. Data Subscriptions / Views
• Data-in-a-View satisfaction can be checked at
the ingestion time, before the data is inserted
• Mimicking client side co-processors – allowing
the use of bulk loading (no coprocessors for
bulk load in the moment)
• When new data arrives, user functions/actions
are triggered
– On demand crawls, focused crawls
22. Second Level Triggers
• User code ran in the reduce phase (when
ingesting):
– Put f(Put p, Row r)
– Previous versions on the input of the code Can alter
the processed Put object before final flush to HFile
• Co-scanner: User-defined scanner traversing the
processed region aligned with the keys in the
created HFile
• Example: Change detection on a re-crawled
resource
23. Aggregation Queries
• Compute frequency of mime types in the
collection
• For a web domain, compute spammicity using
word distribution and out-links
• Based on a two-step aggregation process
– Data is extracted in the Map(), emitted with the
View signature
– Data is collected in the Reduce(), grouped on the
combination (view sig., aggr. key) and aggregated.
24. Aggregation Queries Processing
• Processed using MapReduce, multiple
(compatible) aggregations at once (reading is
the most expensive)
• Aggregation map phase: List<Pairs> map(Row
r), Pair=<Agg_key, value>, Agg_key=<agg_id,
agg_on_value,…>
• Aggregation reduce phase: reduce(Agg_key,
values, context)
25. Aggregation Processing Numbers
Compute mime type distribution of web pages per PLD:
SELECT pld, mime, count(*) FROM web_collection
GROUP BY extract_pld(url), mime
26. Data Ingestion
Our crawler asynchronously writes to Hbase:
Input: archive files (ARC, WARC) in HDFS
Output: Htable
SELECT *, f1(*),..Fn(*) FROM hdfs://path/*.warc.gz
1. Pre-compute the split region boundaries on a data sample
– MapReduce on a data input sample
2. Process batch (~0.5TB) MapReduce ingestion
3. Split manually too big regions (or candidates)
4. If there is still input go to 2.
27. Data Ingestion Numbers
• Store indexable web resources from WARC
files to HBase, detect mime type, language,
extract plain text and analyze RSS feeds
• Reaching steady 40MB/s including extraction
• Upper bound 170MB/s (distributed reading of
archive files in HDFS)
• HBase is idle most of the time!
– Allows compacting store files in the meanwhile
28. Web Archive Collection
• Column families (basic set):
1. Raw content (payload)
2. Meta data (mime, IP, …)
3. Baseline analytics (plain text, detected mime, …)
• Usually one another CF per analytics result
• CFs as secondary indices:
– All analyzed feeds at one place (no need for filter
if I am interested in all such rows)
29. Web Archive Collection II
• More than 3,000 regions (in one collection)
• 12TB of compressed of indexable data (and
counting)
• Crawl to store/process machine ratio is 1:1.2
• Storage scales out
30. HW Architecture
• Tens of small low-consumption nodes with a lot
of disk space:
– 15TB per node, 8GB RAM, dual core CPU
– No enclosure -> no active cooling -> no expensive
datacenter-ish environment needed
• Low per PB storage price (70 nodes/PB), car
batteries as UPS, commodity (real low-cost) HW
(esp. disks)
• Still some reasonable computational power
31. Conclusions
• Conclusions
– Data refinery platform
– Customizable, extensible
– Large scale
• Future work
– Incorporating external secondary indices to filter
HBase rows/cells
• Full text index filtering
• Temporal filtering
– Larger (100s TBs) scale deployment
32. Acknowledgments
• European Union projects:
– LAWA: Longitudinal Analytics of Web Archive data
– SCAPE - SCAlable Preservation Environments