Architecture of ContentMine Components contentmine.org

Architecture of TheContentMine
These slides are for enlightenment and presentations. Use
http://discuss.contentmine.org/t/overall-architecture/142 for up-
to-date info. Questions, comments and critiques welcome! All s/w
is Open (BSD/Apache2)
Some diagrams are autogenerated from *.dot files which are
located in the projects (mainly Norma and AMI)

catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
Latest 20150908

quickscrape Norma Index &
Transform
PDF
XML
URL
DOI
DOC
CSV
sHTML
Plugins
SequencesSpecies
BespokeScrapers XPath
Taggers
Per- Journal
Chemistry
Phylogenetics Plants
AMI
BadHTML
OCR
Diagrams
CAT-alogue index
getpapersquery
Titles+
links
Daily
Crawl/
feed
EuPMC
JToCs
Latest 20150908; limited in scope

Starting points for ingestion
(getpapers/quickscrape/Norma)
• Search/Crawl/Feed-> PMCID,DOI,URL ->
quickscrape ->
CTree(PDF,HTML,XML,images/,meta) ->
Norma -> CMDir(sHTML|TXT|SVG|image)
good
• PDF,XML,TXT,HTML -> Norma ->
CTree(PDF,rawHTML,TXT,images/,meta?) ->
NormaOCR|TXT2HTML ->
CTree(sHTML,TXT,SVG) variable
20150908

Norma Conversions
• Paper-> Scanned -> TIFF (avoid)
• PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG
fast, variable
• PDF -> PDF2SVG-N -> sHTML, SVG, images/.
slow, accurate-ish
• PDF -> PDF2TXT-N -> TXT fast, variable
• PDF -> PDF2Image-N -> PNG fast, accurate
20150908

Norma End points
• Norma -> CTree(OpenSHTML-SVG) ->
everything?
• Norma -> CTree(sHTML. sections) -> AMI -> all
text + species, chemText, sequences)
• Norma -> CTree(TXT (unsectioned)) -> AMI ->
bagOfWords, regex, IDs, species?
• Norma -> CTree(PNG) -> AMI -> phylo, bar/xy-
plots,
• Norma -> CTree(SVG) -> AMI -> phylo, bar/xy-
plots, chemistry

Pre/early Norma toolchain
Transforming PDF and PNG into higher value components
20150908Diagram autogenerated from *.dot graph

getpapers/quickscrape/Norma workflow

Getpapers/quickscrape/Norma: commonest uses

AMI: inputs and outputs for common plugins

Earlier diagrams
Probably significantly out of date, but may
contain useful info.

NORMALIZE
Norma
Convert PDF,XML
To sHTML
Tag sections
Normalized
Scientific
Literature
AMI
Index
Transform
Extract
Search
PDF2SVG
XSL stylesheets
Taggers
normalization
Parameters
“Permanent”
Filestore
Temporary
Filestore
Extracted facts
indexes
Plugins
Regex

PDF
Non-Unicode
Pixel glyphs
No words
No structures
ScholarlyHTML
SVG
High-level
graphics
PDF2SVG
characters
Sentences
Paras
tables
PNG OCR
Tagged
Sections
SVGBuilder
Captioned
Figures
NORMA
XSLT1/2

Raw HTML
Not wellformed
Bad character
semantics
ScholarlyHTML
Well-formed
XHTML
PNG
Tagged
Sections
Captioned
Figures
Tables
Captioned
Tables
XML
HtmlTidy
Jsoup
HtmlUnit
XSLT1/2
XSLT1/2
NORMA
Per-journal
Stylesheets

RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button

quickscrape
Crawl
Feed
Norma Index &
Transform
TXT
XML
URL
DOI
Scientific
literature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
Taggers
Per- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific
Literature + Facts
CANARY pipeline
CAT-alogue index
PDF

Architecture of ContentMine Components contentmine.org

Architecture of ContentMine Components contentmine.org

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Destaque

Destaque (19)

Semelhante a Architecture of ContentMine Components contentmine.org

Semelhante a Architecture of ContentMine Components contentmine.org (20)

Mais de petermurrayrust

Mais de petermurrayrust (20)

Último

Último (20)

Architecture of ContentMine Components contentmine.org