SlideShare uma empresa Scribd logo
1 de 32
Information Extraction: Distilling Structured Data from Unstructured Text. Presenter: Shanshan Lu 03/04/2010
Referenced paper Andrew McCallum: Information Extraction: Distilling Structured Data from Unstructured Text. ACM Queue, volume 3, Number 9, November 2005.  Craig A. Knoblock, Kristina Lerman, Steven Minton, Ion Muslea: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Eng. Bull. 23(4): 33-41 (2000)
Example  Information Extraction: Distilling Structured Data from Unstructured Text Task: try to build a website to help people find continuing education opportunities at colleges, universities, and organization across the country, to support field searches over locations, dates, times etc. Problem: much of the data was not available in structured form. The only universally available public interfaces were web pages designed for human browsing.
Information Extraction: Distilling Structured Data from Unstructured Text
Information extraction Information Extraction: Distilling Structured Data from Unstructured Text Information extraction is the process of filling the fields and records of a database from unstructured or loosely formatted text.
Information extraction Information Extraction: Distilling Structured Data from Unstructured Text Information extraction involves five major subtasks
Technique in information extraction Information Extraction: Distilling Structured Data from Unstructured Text Some simple extraction tasks can be solved by writing regular expressions.  Due to Frequently change of web pages, the previous method is not sufficient for the information extraction task.  Over the past decade there has been a revolution in the use of statistical and machine-learning methods for information extraction.
A Machine Learning Approach Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach A wrapper is a piece of software that enables a semi-structured Web source to be queried as if it were a database.
Contributions  Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach The ability to learn highly accurate extraction rules. To verify the wrapper to ensure that the correct data continues to be extracted. To automatically adapt to changes in the sites from which the data is being extracted.
Learning extraction rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach One of the critical problems in building a wrapper is defining a set of extraction rules that precisely define how to locate the information on the page. For any given item to be extracted from a page, one needs an extraction rule to locate both the beginning and end of that item. A key idea underlying our work is that the extraction rules are based on “landmarks” (i.e., groups of consecutive tokens) that enable a wrapper to locate the start and end of the item within the page.
Samples  Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach
Rules  Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Start rules:  End rules are similar to start rules. Disjunctive rules:
STALKER to learn rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach STALKER : a hierarchical wrapper induction algorithm that learns extraction rules based on examples labeled by the user. STALKER only requires no more than 10 examples because of the fixed web page format and the hierarchical structure. STALKER exploits the hierarchical structure of the source to constrain the learning problem.
STALKER to learn rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach For instance, instead of using one complex rule that extracts all restaurant names, addresses and phone numbers from a page, they take a hierarchical approach. ,[object Object]
Then use another rule to break the list into tuples that correspond to individual restaurants;
finally, from each such tuple they extract the name, address, and phone number of the corresponding restaurant.,[object Object]
STALKER to learn rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Learning a start rule for address: First, it selects an example, say E4, to guide the search.  Second, it generates a set of initial candidates, which are rules that consist of a single 1-token landmark; these landmarks are chosen so that they match the token that immediately precedes the beginning of the address in the guiding example.
STALKER to learn rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Learning a start rule for address: Because R6 has a better generalization potential, STALKER selects R6 for further refinements.  While refining R6, STALKER creates, among others, the new candidates R7, R8, R9, and R10 shown below.
STALKER to learn rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Learning a start rule for address: As R10 works correctly on all four examples, STALKER stops the learning process and returns R10. Result of STALKER: In an empirical evaluation on 28 sources  STALKER had to learn 206 extraction rules.  They learned 182 perfect rules (100% accurate), and another 18 rules that had an accuracy of at least 90%. In other words, only 3% of the learned rules were less that 90% accurate.
Identifying highly informative examples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach The most informative examples illustrate exceptional cases. They have developed an active learning approach called co-testing that analyzes the set of unlabeled examples to automatically select examples for the user to label. Backward:
Identifying highly informative examples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Basic idea: after the user labels one or two examples, the system learns both a forward and a backward rule.  Then it runs both rules on a given set of unlabeled pages. Whenever the rules disagree on an example, the system considers that as an example for the user to label next. Co-testing makes it possible to generate accurate extraction rules with a very small number of labeled examples.
Identifying highly informative examples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Assume that the initial training set consists of E1 and E2, while E3 and E4 are not labeled. Based on these examples, we learn the rules:
Identifying highly informative examples We applied co-testing on the 24 tasks on which STALKER fails to learn perfect rules. The results were excellent: the average accuracy over all tasks improved from 85.7% to 94.2%. Furthermore, 10 of the learned rules were 100% accurate, while another 11 rules were at least 90% accurate.
Verifying the extracted data Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Since the information for even a single field can vary considerably, the system learns the statistical distribution of the patterns for each field.  Wrappers can be verified by comparing the patterns of data returned to the learned statistical distribution. When a significant difference is found, an operator can be notified or we can automatically launch the wrapper repair process.
Automatically repairing wrappers Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Locate correct examples of the data field on new pages. Re-label the new pages automatically. Labeled and re-labeled examples re-run through the STALKER to produce the correct rules for this site.
How to locate the correct example? Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Each new page is scanned to identify all text segments that begin with one of the starting patterns and end with one of the ending patterns. Those segments, which we call candidates. The candidates are then clustered to identify subgroups that share common features (relative position on the page, adjacent landmarks, and whether it is visible to the user).  Each group is then given a score based on how similar it is to the training examples.  We expect the highest ranked group to contain the correct examples of the data field.
Automatically repairing wrappers Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach
Upcoming trends and capabilities Information Extraction: Distilling Structured Data from Unstructured Text Combine IE and data mining to perform text mining as well as improve the performance of the underlying extraction system.  Rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of IE.
Upcoming trends and capabilities Information Extraction: Distilling Structured Data from Unstructured Text SQL --> Database
Information extraction, the Web and the future Information Extraction: Distilling Structured Data from Unstructured Text Second half internet revolution:  machine access to this immense knowledge base
Information extraction, the Web and the future Information Extraction: Distilling Structured Data from Unstructured Text In web search there will be a transition from keyword search on documents to higher-level queries:  queries where the search hits will be objects, such as people or companies instead of simply documents;  queries that are structured and return information that has been integrated and synthesized from multiple pages;  queries that are stated as natural language questions (“Who were the first three female U.S. Senators?”) and answered with succinct responses.

Mais conteúdo relacionado

Mais procurados

Comparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievalComparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievaleSAT Journals
 
Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...JPINFOTECH JAYAPRAKASH
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .tsysglobalsolutions
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSNexgen Technology
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertainjpstudcorner
 
Discovering latent informaion by
Discovering latent informaion byDiscovering latent informaion by
Discovering latent informaion byijaia
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET Journal
 
Query aware determinization of uncertain objects
Query aware determinization of uncertain objectsQuery aware determinization of uncertain objects
Query aware determinization of uncertain objectsCloudTechnologies
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MININGcsandit
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSQUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSShakas Technologies
 
Meta documents and query extension to enhance information retrieval process
Meta documents and query extension to enhance information retrieval processMeta documents and query extension to enhance information retrieval process
Meta documents and query extension to enhance information retrieval processeSAT Journals
 
A Novel Data mining Technique to Discover Patterns from Huge Text Corpus
A Novel Data mining Technique to Discover Patterns from Huge  Text CorpusA Novel Data mining Technique to Discover Patterns from Huge  Text Corpus
A Novel Data mining Technique to Discover Patterns from Huge Text CorpusIJMER
 
IEEE Projects 2015 | Query aware determinization of uncertain objects
IEEE Projects 2015 | Query aware determinization of uncertain objectsIEEE Projects 2015 | Query aware determinization of uncertain objects
IEEE Projects 2015 | Query aware determinization of uncertain objects1crore projects
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal databaseTPO TPO
 

Mais procurados (19)

Comparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrievalComparative analysis of relative and exact search for web information retrieval
Comparative analysis of relative and exact search for web information retrieval
 
Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...Document clustering for forensic analysis an approach for improving computer ...
Document clustering for forensic analysis an approach for improving computer ...
 
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .Ieee transactions on 2018 knowledge and data engineering topics with abstract .
Ieee transactions on 2018 knowledge and data engineering topics with abstract .
 
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm""Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
"Agro-Market Prediction by Fuzzy based Neuro-Genetic Algorithm"
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 
Query aware determinization of uncertain
Query aware determinization of uncertainQuery aware determinization of uncertain
Query aware determinization of uncertain
 
Discovering latent informaion by
Discovering latent informaion byDiscovering latent informaion by
Discovering latent informaion by
 
IRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence ChainIRJET- Missing Data Imputation by Evidence Chain
IRJET- Missing Data Imputation by Evidence Chain
 
Query aware determinization of uncertain objects
Query aware determinization of uncertain objectsQuery aware determinization of uncertain objects
Query aware determinization of uncertain objects
 
Cl4201593597
Cl4201593597Cl4201593597
Cl4201593597
 
A PROCESS OF LINK MINING
A PROCESS OF LINK MININGA PROCESS OF LINK MINING
A PROCESS OF LINK MINING
 
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTSQUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
QUERY AWARE DETERMINIZATION OF UNCERTAIN OBJECTS
 
Text mining
Text miningText mining
Text mining
 
Meta documents and query extension to enhance information retrieval process
Meta documents and query extension to enhance information retrieval processMeta documents and query extension to enhance information retrieval process
Meta documents and query extension to enhance information retrieval process
 
Introductionedited
IntroductioneditedIntroductionedited
Introductionedited
 
M033059064
M033059064M033059064
M033059064
 
A Novel Data mining Technique to Discover Patterns from Huge Text Corpus
A Novel Data mining Technique to Discover Patterns from Huge  Text CorpusA Novel Data mining Technique to Discover Patterns from Huge  Text Corpus
A Novel Data mining Technique to Discover Patterns from Huge Text Corpus
 
IEEE Projects 2015 | Query aware determinization of uncertain objects
IEEE Projects 2015 | Query aware determinization of uncertain objectsIEEE Projects 2015 | Query aware determinization of uncertain objects
IEEE Projects 2015 | Query aware determinization of uncertain objects
 
Multidimensioal database
Multidimensioal  databaseMultidimensioal  database
Multidimensioal database
 

Destaque

OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionFlorian Leitner
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extractionguest0edcaf
 
OUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionOUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionFlorian Leitner
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text MiningMinha Hwang
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text MiningMichel Bruley
 

Destaque (8)

OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
Textmining Information Extraction
Textmining Information ExtractionTextmining Information Extraction
Textmining Information Extraction
 
OUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: IntroductionOUTDATED Text Mining 1/5: Introduction
OUTDATED Text Mining 1/5: Introduction
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Text mining
Text miningText mining
Text mining
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Big Data & Text Mining
Big Data & Text MiningBig Data & Text Mining
Big Data & Text Mining
 
Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 

Semelhante a Information Extraction

Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: butest
 
Annotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontologyAnnotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontologyijnlc
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesIJMER
 
Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & associjerd
 
Automatically Constructing Semantic Web Services From Online Sources
Automatically Constructing Semantic Web Services From Online SourcesAutomatically Constructing Semantic Web Services From Online Sources
Automatically Constructing Semantic Web Services From Online SourcesAsia Smith
 
A Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity StructureA Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity Structureiosrjce
 
A Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient AlgorithmA Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient AlgorithmIOSR Journals
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Miningiosrjce
 
NEXT- A System for Real-World Development, Evaluation, and Application of Act...
NEXT- A System for Real-World Development, Evaluation, and Application of Act...NEXT- A System for Real-World Development, Evaluation, and Application of Act...
NEXT- A System for Real-World Development, Evaluation, and Application of Act...Nicholas Glattard
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...IEEEGLOBALSOFTTECHNOLOGIES
 
Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questionsIEEEFINALYEARPROJECTS
 
An effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded contentAn effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded contentijdpsjournal
 
F0362036045
F0362036045F0362036045
F0362036045theijes
 

Semelhante a Information Extraction (20)

Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web: Accurately and Reliably Extracting Data from the Web:
Accurately and Reliably Extracting Data from the Web:
 
Annotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontologyAnnotation for query result records based on domain specific ontology
Annotation for query result records based on domain specific ontology
 
IJET-V3I2P2
IJET-V3I2P2IJET-V3I2P2
IJET-V3I2P2
 
A Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web DatabasesA Novel Data Extraction and Alignment Method for Web Databases
A Novel Data Extraction and Alignment Method for Web Databases
 
Recommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assocRecommendation system using unsupervised machine learning algorithm & assoc
Recommendation system using unsupervised machine learning algorithm & assoc
 
Automatically Constructing Semantic Web Services From Online Sources
Automatically Constructing Semantic Web Services From Online SourcesAutomatically Constructing Semantic Web Services From Online Sources
Automatically Constructing Semantic Web Services From Online Sources
 
G017334248
G017334248G017334248
G017334248
 
A Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity StructureA Web Extraction Using Soft Algorithm for Trinity Structure
A Web Extraction Using Soft Algorithm for Trinity Structure
 
H017124652
H017124652H017124652
H017124652
 
A Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient AlgorithmA Trinity Construction for Web Extraction Using Efficient Algorithm
A Trinity Construction for Web Extraction Using Efficient Algorithm
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Extraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity MiningExtraction of Data Using Comparable Entity Mining
Extraction of Data Using Comparable Entity Mining
 
E017252831
E017252831E017252831
E017252831
 
L017418893
L017418893L017418893
L017418893
 
NEXT- A System for Real-World Development, Evaluation, and Application of Act...
NEXT- A System for Real-World Development, Evaluation, and Application of Act...NEXT- A System for Real-World Development, Evaluation, and Application of Act...
NEXT- A System for Real-World Development, Evaluation, and Application of Act...
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
 
Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questions
 
-linkedin
-linkedin-linkedin
-linkedin
 
An effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded contentAn effective search on web log from most popular downloaded content
An effective search on web log from most popular downloaded content
 
F0362036045
F0362036045F0362036045
F0362036045
 

Mais de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mais de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Information Extraction

  • 1. Information Extraction: Distilling Structured Data from Unstructured Text. Presenter: Shanshan Lu 03/04/2010
  • 2. Referenced paper Andrew McCallum: Information Extraction: Distilling Structured Data from Unstructured Text. ACM Queue, volume 3, Number 9, November 2005.  Craig A. Knoblock, Kristina Lerman, Steven Minton, Ion Muslea: Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach. IEEE Data Eng. Bull. 23(4): 33-41 (2000)
  • 3. Example Information Extraction: Distilling Structured Data from Unstructured Text Task: try to build a website to help people find continuing education opportunities at colleges, universities, and organization across the country, to support field searches over locations, dates, times etc. Problem: much of the data was not available in structured form. The only universally available public interfaces were web pages designed for human browsing.
  • 4. Information Extraction: Distilling Structured Data from Unstructured Text
  • 5. Information extraction Information Extraction: Distilling Structured Data from Unstructured Text Information extraction is the process of filling the fields and records of a database from unstructured or loosely formatted text.
  • 6.
  • 7. Information extraction Information Extraction: Distilling Structured Data from Unstructured Text Information extraction involves five major subtasks
  • 8. Technique in information extraction Information Extraction: Distilling Structured Data from Unstructured Text Some simple extraction tasks can be solved by writing regular expressions. Due to Frequently change of web pages, the previous method is not sufficient for the information extraction task. Over the past decade there has been a revolution in the use of statistical and machine-learning methods for information extraction.
  • 9. A Machine Learning Approach Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach A wrapper is a piece of software that enables a semi-structured Web source to be queried as if it were a database.
  • 10. Contributions Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach The ability to learn highly accurate extraction rules. To verify the wrapper to ensure that the correct data continues to be extracted. To automatically adapt to changes in the sites from which the data is being extracted.
  • 11. Learning extraction rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach One of the critical problems in building a wrapper is defining a set of extraction rules that precisely define how to locate the information on the page. For any given item to be extracted from a page, one needs an extraction rule to locate both the beginning and end of that item. A key idea underlying our work is that the extraction rules are based on “landmarks” (i.e., groups of consecutive tokens) that enable a wrapper to locate the start and end of the item within the page.
  • 12. Samples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach
  • 13. Rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Start rules: End rules are similar to start rules. Disjunctive rules:
  • 14. STALKER to learn rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach STALKER : a hierarchical wrapper induction algorithm that learns extraction rules based on examples labeled by the user. STALKER only requires no more than 10 examples because of the fixed web page format and the hierarchical structure. STALKER exploits the hierarchical structure of the source to constrain the learning problem.
  • 15.
  • 16. Then use another rule to break the list into tuples that correspond to individual restaurants;
  • 17.
  • 18. STALKER to learn rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Learning a start rule for address: First, it selects an example, say E4, to guide the search. Second, it generates a set of initial candidates, which are rules that consist of a single 1-token landmark; these landmarks are chosen so that they match the token that immediately precedes the beginning of the address in the guiding example.
  • 19. STALKER to learn rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Learning a start rule for address: Because R6 has a better generalization potential, STALKER selects R6 for further refinements. While refining R6, STALKER creates, among others, the new candidates R7, R8, R9, and R10 shown below.
  • 20. STALKER to learn rules Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Learning a start rule for address: As R10 works correctly on all four examples, STALKER stops the learning process and returns R10. Result of STALKER: In an empirical evaluation on 28 sources STALKER had to learn 206 extraction rules. They learned 182 perfect rules (100% accurate), and another 18 rules that had an accuracy of at least 90%. In other words, only 3% of the learned rules were less that 90% accurate.
  • 21. Identifying highly informative examples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach The most informative examples illustrate exceptional cases. They have developed an active learning approach called co-testing that analyzes the set of unlabeled examples to automatically select examples for the user to label. Backward:
  • 22. Identifying highly informative examples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Basic idea: after the user labels one or two examples, the system learns both a forward and a backward rule. Then it runs both rules on a given set of unlabeled pages. Whenever the rules disagree on an example, the system considers that as an example for the user to label next. Co-testing makes it possible to generate accurate extraction rules with a very small number of labeled examples.
  • 23. Identifying highly informative examples Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Assume that the initial training set consists of E1 and E2, while E3 and E4 are not labeled. Based on these examples, we learn the rules:
  • 24. Identifying highly informative examples We applied co-testing on the 24 tasks on which STALKER fails to learn perfect rules. The results were excellent: the average accuracy over all tasks improved from 85.7% to 94.2%. Furthermore, 10 of the learned rules were 100% accurate, while another 11 rules were at least 90% accurate.
  • 25. Verifying the extracted data Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Since the information for even a single field can vary considerably, the system learns the statistical distribution of the patterns for each field. Wrappers can be verified by comparing the patterns of data returned to the learned statistical distribution. When a significant difference is found, an operator can be notified or we can automatically launch the wrapper repair process.
  • 26. Automatically repairing wrappers Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Locate correct examples of the data field on new pages. Re-label the new pages automatically. Labeled and re-labeled examples re-run through the STALKER to produce the correct rules for this site.
  • 27. How to locate the correct example? Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach Each new page is scanned to identify all text segments that begin with one of the starting patterns and end with one of the ending patterns. Those segments, which we call candidates. The candidates are then clustered to identify subgroups that share common features (relative position on the page, adjacent landmarks, and whether it is visible to the user). Each group is then given a score based on how similar it is to the training examples. We expect the highest ranked group to contain the correct examples of the data field.
  • 28. Automatically repairing wrappers Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach
  • 29. Upcoming trends and capabilities Information Extraction: Distilling Structured Data from Unstructured Text Combine IE and data mining to perform text mining as well as improve the performance of the underlying extraction system. Rules mined from a database extracted from a corpus of texts are used to predict additional information to extract from future documents, thereby improving the recall of IE.
  • 30. Upcoming trends and capabilities Information Extraction: Distilling Structured Data from Unstructured Text SQL --> Database
  • 31. Information extraction, the Web and the future Information Extraction: Distilling Structured Data from Unstructured Text Second half internet revolution: machine access to this immense knowledge base
  • 32. Information extraction, the Web and the future Information Extraction: Distilling Structured Data from Unstructured Text In web search there will be a transition from keyword search on documents to higher-level queries: queries where the search hits will be objects, such as people or companies instead of simply documents; queries that are structured and return information that has been integrated and synthesized from multiple pages; queries that are stated as natural language questions (“Who were the first three female U.S. Senators?”) and answered with succinct responses.
  • 33. Thank you! Any questions?