This is the evolving architecture of ContentMine (contentmine.org) architecture. It includes an overview ( slide #2, ) showing getpapers, quickscrape, norma and ami.
The key container is the CTree and the architecture shows where components are added or transformed to this.
These slides are dated and may be out-of-date wrt code. Some diagrams are autogenerated from *.dot files.
Please use http://discuss.contentmine.org/c/software as the main source of up-to-date info. Feel free to ask questions, offer help, critique, etc.
All s/w is Open (BSD, Apache2)
Architecture of ContentMine Components contentmine.org
1. Architecture of TheContentMine
These slides are for enlightenment and presentations. Use
http://discuss.contentmine.org/t/overall-architecture/142 for up-
to-date info. Questions, comments and critiques welcome! All s/w
is Open (BSD/Apache2)
Some diagrams are autogenerated from *.dot files which are
located in the projects (mainly Norma and AMI)
2. catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
Latest 20150908
3. quickscrape Norma Index &
Transform
PDF
XML
URL
DOI
DOC
CSV
sHTML
Plugins
SequencesSpecies
BespokeScrapers XPath
Taggers
Per- Journal
Chemistry
Phylogenetics Plants
AMI
BadHTML
OCR
Diagrams
CAT-alogue index
getpapersquery
Titles+
links
Daily
Crawl/
feed
EuPMC
JToCs
Latest 20150908; limited in scope
12. NORMALIZE
Norma
Convert PDF,XML
To sHTML
Tag sections
Normalized
Scientific
Literature
AMI
Index
Transform
Extract
Search
PDF2SVG
XSL stylesheets
Taggers
normalization
Parameters
“Permanent”
Filestore
Temporary
Filestore
Extracted facts
indexes
Plugins
Regex
13. PDF
Non-Unicode
Pixel glyphs
No words
No structures
ScholarlyHTML
SVG
High-level
graphics
PDF2SVG
characters
Sentences
Paras
tables
PNG OCR
Tagged
Sections
SVGBuilder
Captioned
Figures
NORMA
XSLT1/2
14. Raw HTML
Not wellformed
Bad character
semantics
ScholarlyHTML
Well-formed
XHTML
PNG
Tagged
Sections
Captioned
Figures
Tables
Captioned
Tables
XML
HtmlTidy
Jsoup
HtmlUnit
XSLT1/2
XSLT1/2
NORMA
Per-journal
Stylesheets
15. RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button