5. 80%growth
of unstructured
data is
predicted over
the next five
years.
1.8 zettabytes
of digital data
were in use
worldwide in
2011, up 30%
from 2010.
70% of U.S.
smartphone
owners regularly
shop online via
their devices.
44% of
users (350M
people) access
Facebook via
mobile devices.
50% of
millennials use
mobile devices
to research
products.
60% of U.S.
mobile data will
be audio and
video streaming
by 2014.
Mobility
2/3 of the
world's mobile
data traffic will be
video by 2016.
33%of BI will
be consumed
via handheld
devices
by 2013.
Gaming consoles
are now used an
average of
1.5 hrs/wk
to connect to
the Internet.
1 in 4
Facebook
users add
their location
to posts
(2B/month).
500M
Tweets are
hosted on
Twitter each day
38% of
people
recommend a
brand they “like”
or follow
on a social
network.
100M
Facebook
“likes” per day.
Brands get
Big
Data
Social
Mobility Cloud
Tackling growth in the volume, velocity and variety of data
6. 機能 RDB Big Data
データタイプ 構造化データ 非構造化データ
スキーマ 静的- 書き込み時に必要 動的 – 読み込み時
Read write パターン read/writeの繰り返し Writeは一回、Readの繰り返し
ストレージボリューム Gigabytes to terabytes Terabytes, petabytes, and
beyond
スケーラビリティ スケールアップ スケールアウト
エコノミクス 高価なハードウェアとソフト
ウェア
コモディティハードウェアと
オープンソース
48. // Map function - runs on all nodes
var map = function (key, value, context) {
// split the data into an array of words
var hashtags = value.split(/[^0-9a-zA-Z#]/);
//Loop through the array, creating a value of 1 for each word beginning "#"
for (var i = 0; i < hashtags.length; i++) {
if (hashtags[i].substring(0, 1) == "#") {
context.write(hashtags[i].toLowerCase(), 1);
}
}
};
//Reduce function - runs on reducer node(s)
var reduce = function (key, values, context) {
var sum = 0;
// Sum the counts of each tag found in the map function
while (values.hasNext()) {
sum += parseInt(values.next());
}
context.write(key, sum);
};
49. -- load tweets
Tweets = LOAD 'asv://uploads/data' AS (date, id, author, tweet);
-- split tweet into words
TweetWords = FOREACH Tweets GENERATE date, FLATTEN(TOKENIZE(tweet)) AS tag, id;
--filter words to find hashtags
Tags = FILTER TweetWords BY tag matches '#.*';
-- clean tags by removing trailing periods
CleanTags = FOREACH Tags GENERATE date, LOWER(REPLACE(tag, '¥¥.', '')) as tag, id;
-- group tweets by date and tag
GroupedTweets = GROUP CleanTags BY (date, tag);
-- count tag mentions per group
CountedTagMentions = FOREACH GroupedTweets GENERATE group, COUNT(CleanTags.id) as
mentions;
-- flatten the group to generate columns
TagMentions = FOREACH CountedTagMentions GENERATE FLATTEN(group) as (date, tag),
mentions;
-- load the top tags found by map/reduce previously
TopTags = LOAD 'asv://results/countedtags/part-r-00000' AS (toptag, totalcount:long);
-- Join tweets and top tags based on matching tag
TagMentionsAndTopTags = JOIN TagMentions BY tag, TopTags BY toptag;
-- get the date, tag, totalcount, and mentions columns
TagMentionsAndTotals = FOREACH TagMentionsAndTopTags GENERATE date, tag, totalcount,
mentions;
-- sort by date and mentions
SortedTagMentionsAndTotals = ORDER TagMentionsAndTotals BY date, mentions;
-- store the results as a file
STORE SortedTagMentionsAndTotals INTO 'asv://results/dailytagcounts';
50. CREATE EXTERNAL TABLE dailytwittertags
(tweetdate STRING,
tag STRING,
totalcount INT,
daycount INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '¥t'
STORED AS TEXTFILE LOCATION 'asv://tables/dailytagcount'