4. What is a Cow? the character encoding is clearly stated the document uses a mark-up technology to identify components the components have meaning and possibly behaviour associated with them unreduced data available
5. What we thought the workflow should look like Standoff Annotation File
12. OREChem PDF PSU Soton Atom Atom SVG Text Cam CrystalEye PubChem Atom Molecules Gaussian workflow ORE Triplestore IU http://research.microsoft.com/en-us/projects/orechem/
13. What can we do with a Cow? 5-Cyclobutyl-2,3-dihydro-[1H]-2-benzazepine 82: Potassium carbonate (0.63 g, 4.56 mmol) and thiophenol(0.19 g, 1.69 mmol) were added to the 2- nitrobenzene sulfonamide 50 (0.50 g, 1.302 mmol) in N,N-dimethylformamide(33 cm3) at room temperature and the mixture was stirred for 16 h. Deionised water (50 cm3) was added and the aqueous phase was extracted with ethyl acetate (5 x 50 cm3). The organic extracts were dried (MgSO4) and concentrated under reduced pressure to give the title compound 82 (0.259 g, 1.302 mmol, ca. 100%) as an oil used without further purification.
Most scientific research is communicated in a formal mannerGroup vs Rest of Community Full Text and Supp InfoMore Data Points require semanitcsSliding Scale – Syntax, Vocab, Ontology, Model(Re)Use:Very hard. Has required human glue before now.This is why we need semantics.
Scan of a printoutPicture with Text Comp Chem more strcuture but still hardFree text
Char Enc - many papers are unreadable because the various glyphs are unresolvedMARKUP – XML RDF Sematic Webthe components have meaning and possibly behavior associated with them. – OntologyNot just interpretted dataNot whole document – sometimes entities sometimes sections
PDF 2 Text HardSAFOSCAR
NCEsChemical Terms Chemical DataOMIISections are important – false positives
Only way to determine sections correctly is to preprocess before it goes into OSCAR using SciXML to hold the section imformationHard with PDF because of the the loss of line breaks text from pictures
SciXML – sections, formattingEmbedded objects can be directly turned into CML (JumboConverters)Suddenly find Data XML too
DataXML loses formatting - RegexHard to recombine.Need to know what Data is associated with what preparation hence which moleculeEach step adds sematics – incremental addition of information
Object Reuse and Exchange
We know that this is a preparationBold NumbersStir phrase Add Phrase
TokensEntitiesPOSChunking
Tokens in BoxesDouble boxes = entities
chunks
Complete description of reaction and added data (strcutures)The following query could be used to search for all reactions using N,Ndimethylformamide as a solvent and yields greater than80%.SELECT ?preparationWHERE f?preparationhasSubstance ?substance .?substance hasMolecule<http://www.polymerinformatics.com/#DMF> .?substance hasRole<http://www.polymerinformatics.com/#Solvent> .?preparation hasSubstance ?product .?product hasYield ?yield .FILTER(?yield > 80 ) .
Maps outside55 compounds madeCompletely new view of this thesis
University of Cambridge (UC) and the University of Southern Queensland (USQ) funded by the JISCIntegrated Repository deposition into author workflowFine grained embagoICE allows linking / inclusion of external data filesChem4WordSemantic Authoring for ChemistryLinked ZonesChemically intelligent authoring