Mais conteúdo relacionado Semelhante a Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications (20) Mais de Cloudera, Inc. (20) Hadoop World 2011: Large Scale Log Data Analysis for Marketing in NTT Communications1. Large-Scale Log Analysis for Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Kenji Hara/ Yukio Uematsu Innovative IP Architecture Center NTT Communications Corporation 3. NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Second Sales Division First Sales Division Global Sales Division ... Video & Voice Division Network Services Division Cloud Services Division Applications and Cotent Division Solutions Division Customer Services Division Service Infrastructure Division Systems Division Corporate Planning Division Finance Division ... Innovative IP Architecture Center Staff Operation Product R&D 4. NTT Group, NTT Communications Corporate Structure 100% 100% US$ 12.9B revenue Global data, Internet Access, Voice, IT US$ 24.4B revenue, Local Telecom Nippon Telephone & Telegraph 100% US$ 21.9B revenue, Local Telecom 66.4% US$ 52.8B revenue, Mobile 54.2% US$ 14.5B revenue,System Integration Technical Support, SI Partnership 5. BizCITY: Cloud Services provided by NTT Communications Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. High-Speed Backbone between Datacenters Global NW Secure Connectivity Internet/IP Phone VPN Service ICT Outsourcing Fire Wall Guaranteed Burst Best Effort Domestic International BizHosting Virtual Server Hosting BizMail WebMail, Scheduler SaaS CRM/SFA Internet BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log) Mobile Access Mobile Thin Client Ubiquitous Office Remote Access Mobile Access IP Phone Big Data Analysis BizStorage Online Storage Multi Layer Analysis BizMarketing Big Data (user log) 6. Big Data in BizCITY Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Private Data Analysis Natural Language Processing Statistics Secure & High-Capacity Storage Service Mining Data for Marketing User Log Private Data BizStorage Online Storage Multi Layer Analysis BizMarketing Access Log Use hadoop for “ enormous ” user log analysis CGM Log Query Log B Application Data Feature Next target BizMarketing 7. We provide a “cloud” service for marketing!!! Hadoop in cloud!!!! 8. Hadoop in BizMarketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Analysis Hadoop!! Many Join Operations Increasing Data!! Requirement for scalability Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Tweets Per Day 9. CGM Analysis in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. “ BuzzFinder ” supports marketing activity using customers ’ feedbacks in social media Crawl Crawl Marketer Advertiser Promoter R&D Branding Ads ’ Result Company Reputations Difference with other companies Tweet Blog Search Collect Buzz Finder Blog 10. Data Flow in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. PostgreSQL Hadoop Cluster PostgreSQL NLP and Statistics by Map/Reduce 11. Map/Reduce in BuzzFinder Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. CGM Data size/record is large Small amount of records (x mil /day) Map is costly (mainly by NLP) Keywords Customer Keywords Semtiment Locations Topics Index Data Keywords Semtiment Locations Topics Index Data Keyword Sentiment Location Topic Search Index Map(Data Extract) Keyword Count Topic Count Sentiment Count Location Count Reduce(Statistics) Features Map(NLP) Linguistic &User Data 12. Output of BuzzFinder: Keyword Trend Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Trends of “Nuclear Power Plant”and“Earthquake”in twitter 100,000 50,000 Earthquake Nuclear Power Plant 18565 tweets / day 65642 tweets / day Many tweets about “Earthquake” on 11 th each month Trends of specified keywords in Twitter Heavy white smoke from Fukushima No.1 nuclear power plant. 95,271 tweets 13. Output of BuzzFinder: Topic Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Topics about“Nuclear Power Plant” in September Popular topics about specified keywords in Twitter Topics about “Nuclear Power Plant” Tokyo Electric Power Japan Nuclear Accident Fukushima Noda 14. Output of BuzzFinder: Location Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Location analysis of “ Nuclear Power Plant ” Disaster Area Tokyo Area Many Few Many tweets from big city and disaster area 15. Output of BuzzFinder: Sentiment Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Sentiment analysis of “ Nuclear Power Plant ” APR 2011 AUG 2011 48.4% 51.6% 47.5% 52.5% Positive Negative The sentiment of “ Nuclear Power Plant ” got more negative from April (1 month after the earthquake) to August. The sentiment is more negative than average sentiment(70% positive) 16. Hadoop in Biz Marketing Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Web Access Analysis CGM Data Analysis Hadoop!! Jan 2009 July 2009 Jan 2010 July 2010 Jan 2011 July 2011 Increasing Data!! Tweets Per Day Many Join Operations Requirement for scalability 17. Web Access Analysis Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. ex.) Why users went out without conversion? To find out internet-users’ behavior inside of the site Click stream based analysis 19. Hadoop for PaaS Services Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. At a same speed Server reduction Speeding-up technique 1. Summation 2. OLAP(multi join processing) Want to reduce the cost! Normal Hadoop Cluster High Speed Hadoop Cluster Map/Reduce speeding-up technique 20. Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Our Cluster Normal Cluster Elephant in Cloud runs FAST!! 21. Strategies for Cost Reduction Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Map Multi-Reduce * Record reduce HashMap-based pre-combining before combiner advantages: 1) efficient combining by HashMap 2) reduction of # of spill operation Local reduce Combining mapper outputs in same servers advantages: reduction of amount of shuffle Pjoin ** Join with pre-partitioning and semi-join advantages: efficient for multi-table joins *, ** “ Map Multi-Reduce ” and “ Pjoin ” are developed in NTT labs; the source code is closed now. Statistics (summation) OLAP (join) 22. Map Multi-Reduce/Record Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. input Map MapOutputBuffer sort&spill Spill files mergeParts Output Normal map/reduce Map/ r educe with r ecord reduce Input Map MapOutputBuffer sort&spill Spill files mergeParts Output Record reduce Pre-combining function before combiner Pre-combining in map function to reduce # of spill operation Map Task Reduce Task Server Process File Smaller output buffer 23. Map Multi-Reduce/Local Reduce Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. User Program worker worker worker Input Data fork fork fork Master worker worker assign map assign reduce local write remote read, sort Output File 0 Output File 1 Split 1 Split 0 Split 2 Split 3 Split 4 read worker worker worker worker worker assign local reduce Server Process File Pre-reduce data in the same server before combiner function Local Reduce タスク Local Reduce タスク Local Reduce Twice as fast as the normal cluster 26. Pjoin/Join using Semi-Join View Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. Query execution pageinfo z Pre-processing pageinfo click_strm pageinfo primary key & foreign key (click_strm primary key) Site description data Pre-processing redundant data for multiple join Join in map-side using pre-partitioning, and only rest of join in reduce side click_strm processing + semi-join mapper … click_strm processing + semi-join pageinfo a pageinfo _ click_strm 1 … pageinfo _ click_strm n click_strm n click_strm 1 Joining with pageinfo reducer … Joining with pageinfo … pageinfo b pageinfo a pageinfo z click_strm 1 click_strm n pageinfo _ click_strm n pageinfo _ click_strm 1 … hash(x) hash(y) hash(y) DFS read shuffle 27. Experimental Evaluation (Pjoin) Copyright © 2011 NTT Communications Co., Ltd. All Rights Reserved. 1TB access log join processing using Pjoin to verify the effectiveness HiveQL No. of servers Processing time (min) Pjoin vs Hive(reduce side join) Pjoin(50 servers) Hive(50servers) Pjoin(20 servers) 50 servers(normal hadoop cluster) 23 servers (Pjoin applied cluster) = same speed!! insert overwrite table q1_result select count(distinct s_sessionseqid) from clckstrm c join page p on c.c_pageseqid = p.p_pageseqid and p.p_url like '%blog.goo.ne.jp%' join session_info s on s.s_clckstrmseqid = c.c_clckstrmseqid and s.s_referer like '%QUERY%';