The document discusses declarative approaches to information extraction that address issues with traditional rule-based and machine learning-based methods. Declarative approaches use a declarative language and programming model to specify extraction tasks, enabling scalable infrastructure and development support. The talk will cover how declarative information extraction allows scalable processing, provides development tools, and conclude with questions.
To update the Collection-Centric, add auxiliary index + annotation store
Each extraction result is stored with its source document and its associated positions in the document
Basically:
Convert JAPE rule into a relational calculus expression => Big self-join over a table of <word, position> pairs
Generate efficient join plan using (inverted) index access when possible
Some part still require going back to the document --- want these high in the operator graph
At the high level, the optimization strategy is very similar to the one in System R, but with novel access method, novel join algorithms, 2-dismensional cost model
The document-centric model enables embedding SystemT in a wide variety of applications.
For instance, in lotus notes, when a user opens an email, at the same time, that email message is sent to SystemT runtime which will generate annotations on the fly.
When the email is displayed for the user, the annotations just generated will be displayed as well.
Meanwhile, SystemT can also be embedded as a Map job in a map-reduce framework, which allows the system to scale up and process large volume of documents.