SlideShare uma empresa Scribd logo
1 de 26
Baixar para ler offline
News and Blog Analysis
      with Lydia
 Charles Ward – Stony Brook University
 Karthik Balaji, Levon Lloyd – General Sentiment




October 2nd, 2009
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Large-Scale News/Blog Analysis
  The Lydia news/blog analysis system does a daily
  analysis of over 1000+ English and foreign-language
  online newspapers, plus blogs, and other text sources.
  We currently track tens of millions of named entities in
  the news and blogs, providing spatial, temporal,
  relational and sentiment analysis.
  Customer's track entities of interest using reports
  generated in our user interface.
Lydia Text Analysis Phases
  Lydia performs named entity recognition and analysis over large
  text corpora.
      Spidering: Lydia spiders and parses thousands of online news
      sources. We also handle the feed of social media provided by
      Spinn3r.
      Named Entity Recognition: Lydia identifies and classifies
      occurrences of named entities (people, places, companies,
      etc.)
      Sentiment Analysis: Lydia assigns sentiment scores to
      identified entities using shallow NLP techniques.
      Entity Statistics Aggregation: Lydia digests marked-up text
      and produces usable entity statistics.
      Data Exploration: Aggregated entity statistics are made
      available through user interfaces and programming APIs for
      detailed exploration of the data.
Lydia Architecture
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Frequency Time Series
  Michael Vick references (2004-2009)




  Mel Gibson references (2004-2009)
Sentiment Analysis
  Michael Phelps sentiment score (June 2008-Feb
  2009)




  David Paterson sentiment score (Jan 2008-Jul 2009)
Comparative Analysis
  Peyton Manning vs. Eli Manning
Heatmaps




 Arnold Schwarzenegger   Alabama
Ethnic Biases in News Coverage




 Frequency of coverage of entities Percentage of population self-
    with Hispanic names in the        reporting as Hispanic in the 2000
    U.S. news, 2004-2008              census. Courtesy of Wikipedia.
Ethnic Biases in News Coverage


  (a) African
  (b) Hispanic
  (c) East Asian
  (d) Indian
  (e) Eastern
  European
  (f) Muslim
Juxtaposition Analysis
   Top Juxtapositions for Barack Obama




   Juxtapositions between Barack Obama and John McCain
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Hadoop in Lydia
  The legacy Perl NLP pipeline runs in parallel on
  Hadoop Streaming, generating articles with marked-up
  entities which are stored as compressed XML in
  HDFS.

  To build or update Lydia entity statistics and indexes
  for a single text corpus, over 80 map-reduce jobs are
  necessary.

  We have developed a custom workflow management
  framework in Amazon EC2 to manage our data and
  processing.
Lydia Workflow Framework
 High-level concepts:
   A depository is a statistics dataset derived from a
   text corpus. It consists of artifacts.
     Stored as a directory structure in HDFS
   An artifact is a homogeneous dataset of a specific
   type.
     Examples:
        Key-value artifacts, e.g. entity name -> frequency time
        series
        Lucene index artifacts (entity and article indexes)
     Stored as a directory in HDFS containing several map-
     reduce job output subdirectories named as date ranges
     (we do updates on a daily granularity).
Artifact Dependencies
 Most Lydia artifacts are derived from other artifacts:
Artifact Storage

 Lydia artifacts are stored in HDFS inside the
   depository directory:
   /dailies (depository name)
     /EXACT_DUP_ARTICLES (artifact name)
        /2004_11_01-2009_03_31 (date range-named MR output)
           /part-00000
           ...
           /part-00017
        /2009_04_01-2009_04_02
           ...
        /2009_04_03. . .
Job Input Selection
   Artifact updates are incrementally propagated through
   the dependency graph:




   Multiple date ranges (sometimes overlapping) typically
   exist for each artifact.
   Some small artifacts get fully rebuilt on every update.
Depository Build Scheduling
   The same tool is used for the initial depository build
   and for updating it with new data.

   Any set of target artifacts to build can be specified,
   similarly to a makefile. Prerequisites of the targets are
   automatically identified.

   Artifacts are built in the correct order according to
   dependencies.

   The build process runs as a sequence of Hadoop
   map-reduce jobs and occasional serial jobs.
Amazon EC2
  We run Hadoop on Amazon EC2.
  – Quickly scale capacity as requirements change.
  10 extra large nodes for weekly data processing.
  Amazon S3 is our persistent data store.
  All our web services are hosted in dedicated amazon
  nodes.
  S3 is not meeting our required level-of-service
   – Moving to EBS
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Depository Server
  Random access to the Lydia depository, e.g.:
    Monthly frequency time series of Barack Obama in all
    U.S. sources
    Top juxtapositions for Continental Airlines in February
    2009
    Sentiment time series for Michael Phelps in all U.S.
    sources
  Uses the mapfiles generated by map-reduce jobs.
  Currently is not distributed (but we can put
  different depositories on different machines).
  Provides a caching subsystem to reduce the
  number of HDFS accesses.
Artifact Date Range Merging

   The depository server combines results from
   multiple groups of mapfiles on the fly.
   (MR output = date range = mapfile group)
     This may result in performance problems and
     memory shortage (direct memory buffers).
     Solution: limit the number of covering date ranges
     to be O(log N) after N daily updates.
Outline

   Lydia System Overview
   News Analysis Examples
   Data and Workflow Organization
   Data Access Interface
   Conclusion
Conclusion

  Great improvement (up to 20x) in the
  Lydia system performance and
  scalability from using Hadoop.
  Lydia w/ Hadoop makes new types of
  automated analysis of web-scale content
  possible.

Mais conteúdo relacionado

Mais procurados

Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your datasetTuri, Inc.
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
MongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big DataMongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big DataMuhammad zubair
 
Talis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineTalis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineLeigh Dodds
 
ORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaCrossref
 
Big data presentation
Big data presentationBig data presentation
Big data presentationSreeSowmya7
 
Collecting and using funding data in your publications
Collecting and using funding data in your publicationsCollecting and using funding data in your publications
Collecting and using funding data in your publicationsCrossref
 
Automated creation of analytic catalog records for born digital journal articles
Automated creation of analytic catalog records for born digital journal articlesAutomated creation of analytic catalog records for born digital journal articles
Automated creation of analytic catalog records for born digital journal articlesNASIG
 
Collecting and Using Funding Data Crossref
Collecting and Using Funding Data CrossrefCollecting and Using Funding Data Crossref
Collecting and Using Funding Data CrossrefRelawan Jurnal Indonesia
 
Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629Mark Tabladillo
 
Big Data - Linked In_DEEPU
Big Data - Linked In_DEEPUBig Data - Linked In_DEEPU
Big Data - Linked In_DEEPUDeepu M
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amirydatastack
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataTrieu Nguyen
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web CorpusRobert Meusel
 

Mais procurados (20)

Hadoop
HadoopHadoop
Hadoop
 
Using the whole web as your dataset
Using the whole web as your datasetUsing the whole web as your dataset
Using the whole web as your dataset
 
Big data computing
Big data computingBig data computing
Big data computing
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
MongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big DataMongoDB and Hadoop Handling for Big Data
MongoDB and Hadoop Handling for Big Data
 
Talis Platform: A Linked Data Engine
Talis Platform: A Linked Data EngineTalis Platform: A Linked Data Engine
Talis Platform: A Linked Data Engine
 
ORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE IndonesiaORCID at Crossref LIVE Indonesia
ORCID at Crossref LIVE Indonesia
 
Big data presentation
Big data presentationBig data presentation
Big data presentation
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Collecting and using funding data in your publications
Collecting and using funding data in your publicationsCollecting and using funding data in your publications
Collecting and using funding data in your publications
 
Automated creation of analytic catalog records for born digital journal articles
Automated creation of analytic catalog records for born digital journal articlesAutomated creation of analytic catalog records for born digital journal articles
Automated creation of analytic catalog records for born digital journal articles
 
Collecting and Using Funding Data Crossref
Collecting and Using Funding Data CrossrefCollecting and Using Funding Data Crossref
Collecting and Using Funding Data Crossref
 
Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629Microsoft and Revolution Analytics -- what's the add-value? 20150629
Microsoft and Revolution Analytics -- what's the add-value? 20150629
 
Big Data - Linked In_DEEPU
Big Data - Linked In_DEEPUBig Data - Linked In_DEEPU
Big Data - Linked In_DEEPU
 
Hadoop intro
Hadoop introHadoop intro
Hadoop intro
 
Data lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiryData lake-itweekend-sharif university-vahid amiry
Data lake-itweekend-sharif university-vahid amiry
 
Hadoop
HadoopHadoop
Hadoop
 
Slide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big dataSlide 2 collecting, storing and analyzing big data
Slide 2 collecting, storing and analyzing big data
 
Mining a Large Web Corpus
Mining a Large Web CorpusMining a Large Web Corpus
Mining a Large Web Corpus
 

Destaque

Natural language processing
Natural language processingNatural language processing
Natural language processingHansi Thenuwara
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
2.2 Demonstrate the understanding of Programming Life Cycle
2.2 Demonstrate the understanding of Programming Life Cycle2.2 Demonstrate the understanding of Programming Life Cycle
2.2 Demonstrate the understanding of Programming Life CycleFrankie Jones
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language ProcessingJaganadh Gopinadhan
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Basics of Robotics
Basics of RoboticsBasics of Robotics
Basics of RoboticsAmeya Gandhi
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentationlpaviglianiti
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligenceu053675
 

Destaque (16)

Nlp
NlpNlp
Nlp
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
NLP
NLPNLP
NLP
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
AI Robotics
AI RoboticsAI Robotics
AI Robotics
 
2.2 Demonstrate the understanding of Programming Life Cycle
2.2 Demonstrate the understanding of Programming Life Cycle2.2 Demonstrate the understanding of Programming Life Cycle
2.2 Demonstrate the understanding of Programming Life Cycle
 
Practical Natural Language Processing
Practical Natural Language ProcessingPractical Natural Language Processing
Practical Natural Language Processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Artificial inteligence
Artificial inteligenceArtificial inteligence
Artificial inteligence
 
Basics of Robotics
Basics of RoboticsBasics of Robotics
Basics of Robotics
 
Artificial Intelligence Presentation
Artificial Intelligence PresentationArtificial Intelligence Presentation
Artificial Intelligence Presentation
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 

Semelhante a Hw09 Understanding Natural Language

Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook AhmedDoukh
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010nzhang
 
Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Mahmoud Sabri
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Darko Marjanovic
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Milos Milovanovic
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWSSajith Appukuttan
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadDeborah Gastineau
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiativeMansi Mehra
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azureEyal Ben Ivri
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Yahoo Developer Network
 

Semelhante a Hw09 Understanding Natural Language (20)

INTRODUCTION OF BIG DATA
INTRODUCTION OF BIG DATAINTRODUCTION OF BIG DATA
INTRODUCTION OF BIG DATA
 
Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data , Big Problem?
Big Data , Big Problem?Big Data , Big Problem?
Big Data , Big Problem?
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Data infrastructure at Facebook
Data infrastructure at Facebook Data infrastructure at Facebook
Data infrastructure at Facebook
 
Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010Hive @ Hadoop day seattle_2010
Hive @ Hadoop day seattle_2010
 
Ess1000 glossary
Ess1000 glossaryEss1000 glossary
Ess1000 glossary
 
Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?Big Data .. Are you ready for the next wave?
Big Data .. Are you ready for the next wave?
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014Hadoop and IoT Sinergija 2014
Hadoop and IoT Sinergija 2014
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Architecting Data Lakes on AWS
Architecting Data Lakes on AWSArchitecting Data Lakes on AWS
Architecting Data Lakes on AWS
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) Had
 
Hadoop - A big data initiative
Hadoop - A big data initiativeHadoop - A big data initiative
Hadoop - A big data initiative
 
Building big data solutions on azure
Building big data solutions on azureBuilding big data solutions on azure
Building big data solutions on azure
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
Chachra, "Improving Discovery Systems Through Post Processing of Harvested Data"
 
Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010Hadoop and Pig at Twitter__HadoopSummit2010
Hadoop and Pig at Twitter__HadoopSummit2010
 

Mais de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

Mais de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Hw09 Understanding Natural Language

  • 1. News and Blog Analysis with Lydia Charles Ward – Stony Brook University Karthik Balaji, Levon Lloyd – General Sentiment October 2nd, 2009
  • 2. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 3. Large-Scale News/Blog Analysis The Lydia news/blog analysis system does a daily analysis of over 1000+ English and foreign-language online newspapers, plus blogs, and other text sources. We currently track tens of millions of named entities in the news and blogs, providing spatial, temporal, relational and sentiment analysis. Customer's track entities of interest using reports generated in our user interface.
  • 4. Lydia Text Analysis Phases Lydia performs named entity recognition and analysis over large text corpora. Spidering: Lydia spiders and parses thousands of online news sources. We also handle the feed of social media provided by Spinn3r. Named Entity Recognition: Lydia identifies and classifies occurrences of named entities (people, places, companies, etc.) Sentiment Analysis: Lydia assigns sentiment scores to identified entities using shallow NLP techniques. Entity Statistics Aggregation: Lydia digests marked-up text and produces usable entity statistics. Data Exploration: Aggregated entity statistics are made available through user interfaces and programming APIs for detailed exploration of the data.
  • 6. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 7. Frequency Time Series Michael Vick references (2004-2009) Mel Gibson references (2004-2009)
  • 8. Sentiment Analysis Michael Phelps sentiment score (June 2008-Feb 2009) David Paterson sentiment score (Jan 2008-Jul 2009)
  • 9. Comparative Analysis Peyton Manning vs. Eli Manning
  • 11. Ethnic Biases in News Coverage Frequency of coverage of entities Percentage of population self- with Hispanic names in the reporting as Hispanic in the 2000 U.S. news, 2004-2008 census. Courtesy of Wikipedia.
  • 12. Ethnic Biases in News Coverage (a) African (b) Hispanic (c) East Asian (d) Indian (e) Eastern European (f) Muslim
  • 13. Juxtaposition Analysis Top Juxtapositions for Barack Obama Juxtapositions between Barack Obama and John McCain
  • 14. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 15. Hadoop in Lydia The legacy Perl NLP pipeline runs in parallel on Hadoop Streaming, generating articles with marked-up entities which are stored as compressed XML in HDFS. To build or update Lydia entity statistics and indexes for a single text corpus, over 80 map-reduce jobs are necessary. We have developed a custom workflow management framework in Amazon EC2 to manage our data and processing.
  • 16. Lydia Workflow Framework High-level concepts: A depository is a statistics dataset derived from a text corpus. It consists of artifacts. Stored as a directory structure in HDFS An artifact is a homogeneous dataset of a specific type. Examples: Key-value artifacts, e.g. entity name -> frequency time series Lucene index artifacts (entity and article indexes) Stored as a directory in HDFS containing several map- reduce job output subdirectories named as date ranges (we do updates on a daily granularity).
  • 17. Artifact Dependencies Most Lydia artifacts are derived from other artifacts:
  • 18. Artifact Storage Lydia artifacts are stored in HDFS inside the depository directory: /dailies (depository name) /EXACT_DUP_ARTICLES (artifact name) /2004_11_01-2009_03_31 (date range-named MR output) /part-00000 ... /part-00017 /2009_04_01-2009_04_02 ... /2009_04_03. . .
  • 19. Job Input Selection Artifact updates are incrementally propagated through the dependency graph: Multiple date ranges (sometimes overlapping) typically exist for each artifact. Some small artifacts get fully rebuilt on every update.
  • 20. Depository Build Scheduling The same tool is used for the initial depository build and for updating it with new data. Any set of target artifacts to build can be specified, similarly to a makefile. Prerequisites of the targets are automatically identified. Artifacts are built in the correct order according to dependencies. The build process runs as a sequence of Hadoop map-reduce jobs and occasional serial jobs.
  • 21. Amazon EC2 We run Hadoop on Amazon EC2. – Quickly scale capacity as requirements change. 10 extra large nodes for weekly data processing. Amazon S3 is our persistent data store. All our web services are hosted in dedicated amazon nodes. S3 is not meeting our required level-of-service – Moving to EBS
  • 22. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 23. Depository Server Random access to the Lydia depository, e.g.: Monthly frequency time series of Barack Obama in all U.S. sources Top juxtapositions for Continental Airlines in February 2009 Sentiment time series for Michael Phelps in all U.S. sources Uses the mapfiles generated by map-reduce jobs. Currently is not distributed (but we can put different depositories on different machines). Provides a caching subsystem to reduce the number of HDFS accesses.
  • 24. Artifact Date Range Merging The depository server combines results from multiple groups of mapfiles on the fly. (MR output = date range = mapfile group) This may result in performance problems and memory shortage (direct memory buffers). Solution: limit the number of covering date ranges to be O(log N) after N daily updates.
  • 25. Outline Lydia System Overview News Analysis Examples Data and Workflow Organization Data Access Interface Conclusion
  • 26. Conclusion Great improvement (up to 20x) in the Lydia system performance and scalability from using Hadoop. Lydia w/ Hadoop makes new types of automated analysis of web-scale content possible.