1. Alex D. Wade
Senior Research Program Manager
External Research
Microsoft Research
Microsoft Corporation
2. • Science @ Microsoft
– and the role of Scholarly Communication
• Office 2007
– File Format Overview
– Bibliography Support
– UI Extensibility
• A Sampling of Related Projects
3. Putting computing into science…
Applying Microsoft products and research technologies to
advance the scientific research and engineering innovation
process
Putting science into computing…
Ensuring that research community requirements are factored into
future versions of Microsoft software
• Advancement of Science
• Global Collaboration
• Technology Excellence
• Interoperability
4. • Science + computation are not the entire equation
• Authoring, Analysis, Publishing, Discoverability, and Data
Storage/Preservation are key components to scientists’
everyday work…and Microsoft’s core businesses
• The scholarly community has made it clear to us:
• Microsoft must improve its offerings throughout the
scholarly communication lifecycle
• Our approach: Conduct prototyping projects and
proofs-of-concept to evolve Microsoft’s scholarly
communication offerings
5. • Data Acquisition and Modeling
– Data capture from source, cleaning, storage, etc.
– SQL Server, SQL Integration Services, Windows Workflow Foundation
• Support Collaboration
– Allow researchers to work together, share context, facilitate interactions
– SharePoint Server, One Note 2007 (shared)
• Data Analysis, Modeling, and Visualization
– Mining techniques (OLAP, cubes) and visual analytics
– SQL Analysis Services, BI, Excel, Optima, SILK (MSR-A)
• Disseminate and Share Research Outputs
– Publish, Present, Blog, Review and Rate
– Word, PowerPoint
• Archiving
– Published literature, reference data, curated data, etc.
– SQL Server
Microsoft is the only company that can offer end-to-end support
5
6. • Optimize for data-driven research & science
– To both data (scientific) and to information (scholarly publications)
– Reproducible research + computational science
– Properly document / annotate scholarly output
• Interoperability is paramount
– Actively lobby and drive for consensus around technical standards and standardized protocols proactively
adopted by the community; enable broad community engagement
• Customers have told Microsoft that the interoperability (and intellectual property) are OUR responsibility
• Data preservation (and provenance) should be baseline
– Documentation of the data’s provenance
– Reliable and secure long-term storage – at a massive scale
– Preservation needs to be like “accessibility” features – i.e., assumed as required
• Social networking & semantic knowledge discovery
– Harnessing collective intelligence must be a consideration – since accessing research is a core step in the
life-cycle. Enable knowledge discovery
– Optimize for Web 2.0 scenarios and allow end-users/experts to find things easier
• Metadata conventions / taxonomies / ontologies
– This is a crucial strength for libraries – and a critical component in enabling Web 2.0
7. • New file format
– New file extension (DOCX)
– All content expressed in XML (Office Open XML)
– Contained in a zip file (OPC)
• ECMA specification – 376 & ISO Standard
– OpenXML
– Open Packaging Conventions
8. • Easy to access the different parts of document
– XML file
– Images
– Annotations
• Simpler to transform Word’s XML into other XML formats
or extract relevant data
• Ability to build .docx files programmatically or through
transformations
• Ability to extend Word UI (and content) to support
additional or custom data
9. • Compatibility pack
– Open and save to docx from older Word versions
• Add-in to export to PDF or XPS
• ODF Converter
– Open Source project on SourceForge
– Provides two-way conversion between ODF and
OpenXML (WordprocessingML, SpreadsheetML, and
PresentationML)
– ‘Save As ODF’ to be included in Office 2007 SP2
11. • Sources saved as Bibliography XML
• Sources.XML contains all sources
• Sources can be imported into new documents
for easy reuse
• Sources.XML can be shared between users
• Documentation Styles are XSLTs
12. • Citations and Bibliographies can be inserted
inline with a single click
• Automatically Formatted according to active
Documentation Style
15. • Tools for Authors
– Search Commands in Office
– Ribbon for Researchers
• Semantic Information
– Ontology-based markup of scholarly papers
– Authoring of chemical drawings + semantic information
– NLM DTD (Pablo Fernicola)
• Data Preservation & Access
– File format preservation + interoperability
– Scientific datasets for research reproducibility
– Publisher submission workflow for dataset archiving
16. Search Commands in Office
Search Commands in Office
Office Labs
Office Labs
Goals
• Office 2007 Add-in that aids in finding commands, options, wizards and
galleries in Word, Excel and PowerPoint
• Includes Guided Help, which acts as a tour guide for specific tasks
Project Status
• Available now via http://www.officelabs.com/projects/searchcommands/
19. Search against the Live Search
Search against the Live Search
Academic service straight
Academic service straight
from within Word
from within Word
One-click insert to the
One-click insert to the
bibliography
bibliography
Integration with various services
Integration with various services
20. Semantic Markup in Word 2007
Semantic Markup in Word 2007
with UC San Diego
with UC San Diego
Goals
• Semantic markup using domain-specific ontologies and controlled vocabularies
• Facilitate/automate referencing to PDB (and other resources) from manuscript
• A domain-specific ontology is downloaded and made available from within
Microsoft Word 2007
• Authors can record their intention, the meaning of the terms they use based on
their community’s agreed vocabulary
Project Status
• Phase 1 complete
• Beta testing with PLoS later this year
21. Domain-specific ontology Annotations travel with the
document
Can be used to improve
domain-specific discovery of
information, cross-linking,
etc.
Support for annotations
straight from within Word
22. Chemistry Drawing for Office
Chemistry Drawing for Office
Preliminary investigation
Preliminary investigation
Goals
• Support students/researchers in simple chemistry structure
authoring/editing
• Storage and transportability of semantic chemical data not just images via
Chemistry Markup Language (CML)
• Enable automatic extraction/harvesting of chemical data
Project Status
• Early investigation stage
• Will be encouraging on-going publisher feedback
23. PLANETS
PLANETS
Long-term Preservation of
Long-term Preservation of
Digital Objects
Digital Objects
Organization
• EU Commission Project, €14M for 4 years
• Consortium of 5 national libraries, 4 national archives, 4 universities and 4
industry partners
Goals
• Tools and methods for sustainable long-term preservation of digital objects
• Preservation of Office Documents based on OpenXML
Project Status
• OpenXML conversion tools available now:
– http://research.microsoft.com/research/rpp/projects/MSConversionTools/OpenXMLConversionTools.htm
24. GenePattern for Word 2007
GenePattern for Word 2007
with Broad Institute @ MIT
with Broad Institute @ MIT
Goals
•Integrate data/images from GenePattern workflows into research papers.
•Allow for research reproducibility by combining data with the text
•Highlight OpenXML and Office 2007 technologies and break new research
ground with the integration of data & workflows with research papers
•Testing/linkage to other labs – moving beyond initial installation
Project Status
•Currently in final phase of testing
•Will move into production in June 2008
•Code to be published http://www.codeplex.com
25.
26. Data Archive Project
Data Archive Project
with Johns Hopkins University
with Johns Hopkins University
Goals
•Mechanism for long-term preservation of data sets
•Authoring tool to support creation of relationship resource map
•Use of OAI-ORE resource maps for collection description
•Workflow for text & data linkage between publisher and data archive
27. Word 2007 OPC format
Word 2007 OPC format
contains data set(s) as well as
contains data set(s) as well as
resource map of
resource map of
relationships.
relationships.
author
Publisher retains article and
Publisher retains article and
replaces it with the article
replaces it with the article
URL. Forwards data to Data
URL. Forwards data to Data publisher
Archive
Archive
archive
Archive stores data set(s) and
Archive stores data set(s) and
returns data set URL(s) to publisher
returns data set URL(s) to publisher
as part of updated resource map
as part of updated resource map
28. • Direct publisher/repository submission via Word
• Research Output Repository Platform
• Conference Management Tool
• eJournal Service
• …
Alex D. Wade
alex.wade@microsoft.com
http://www.microsoft.com/science/
29. Compatibility packs for older versions of Word
• http://www.microsoft.com/downloads/details.aspx?FamilyId=941B3470-3A
Add-in for saving to PDF or XPS
• http://www.microsoft.com/downloads/details.aspx?FamilyId=4D951911-3E
SDK for OpenXML formats
• http://msdn2.microsoft.com/en-us/library/bb448854.aspx
Developer community forum
• http://openxmldeveloper.org/
Open Source OpenXML/ODF converter (both ways)
• http://sourceforge.net/projects/odf-converter/
30.
31. Microsoft ventures into open access chemistry
Royal Society of Chemistry
By Richard van Noorden
January 29th, 2007
http://www.rsc.org/chemistryworld/News/2008/January/29010803.asp
Computational chemists have secured funding from computing giant Microsoft to showcase how chemistry can benefit from open access data sharing on the
internet.
The two-year eChemistry pilot project represents 'a major test case' for proposed new protocols for sharing scholarly information over the web, said Lee Dirks,
director of scholarly communications at Microsoft Research. Microsoft's support is also a boost for the small band of chemists keen to promote open access
internet publishing.
The public-private collaboration is one of many Microsoft projects to probe the potential of computing to advance scientific research,
and bring back what they learn to improve the company's product line, Dirks told Chemistry World. 'But chemistry is a discipline we've not
typically worked in,' he said. 'From everything I've heard, it's not as progressive a field as, say, astronomy in use of the web'.
Most chemical information on the web is published in closed journals and databases which guarantee high quality but also require a subscription to view. Pre-
print servers, collaborative documents, open databases, video sites, online lab notebooks and blogs provide other ways of communicating research. Combining
the lot offers the enticing prospect of a vast, free-to-access repository. This could transform the sharing of scientific research if the disparate data
sources were machine-readable, so that a search engine could automatically gather data about a particular molecule from a crystal
structure, a movie, an online lab book, and an archived article, for example.
Radical change
The international standards required for this challenge are being developed by the Open Archives Initiative Object Reuse and Exchange Project (OAI-ORE),
based at Cornell University, Ithaca, US. Their model protocols will be officially launched on 3 March at Johns Hopkins University in Maryland.
The eChemistry project, Dirks explained, was chosen as an exemplar to show that the new standards are actually useful to scientists. Chemists and computer
scientists at Cambridge and Southampton universities in the UK, and Indiana, Cornell, and Penn State in the US, will search and index existing online
databases and print archives; and work out how best to record chemistry data captured in lab experiments. The results will be hosted by the US National
Institutes of Health open access PubChem database and other repositories.
32. http://chronicle.com/daily/2008/02/1585n.htm
Monday, February 11, 2008
Researchers Develop Online Tools for Science Collaborations
By LILA GUTERMAN
Blogs, wikis, and social-networking sites such as Facebook may get media buzz these days, but for scientists, engineers, and doctors, they are not even on the radar.
The most effective tools of the Internet for such people tend to be efforts more narrowly targeted to their needs, such as software that helps geneticists replicate one
another's experiments. That was the underlying message of many presentations at the annual conference of the Professional/Scholarly Publishing Division of the
Association of American Publishers held here last week.
Philip E. Bourne, a professor of pharmacology at the University of California at San Diego, spoke about the Web site SciVee, where scientists can link
videos to their research papers that appear in open-access biomedical journals (The Chronicle, August 21, 2007). Mr. Bourne, who created the site,
calls the videos pubcasts; they are typically about 10 minutes long and go into more detail than an abstract but less than the full-length article.
The videos are coming in at a trickle, says Mr. Bourne. (He attributes the slow rate to the high quality: the graduate students and postdoctoral
researchers who make the videos have been crafting polished presentations.) But some of the ones already online have been viewed more than
100,000 times. When the pubcasts are uploaded, Mr. Bourne has also witnessed a steep increase in downloads of the linked article.
Jill P. Mesirov described an application that she hopes will ultimately become mainstream for journals that publish computational science. Ms. Mesirov,
director of computational biology and bioinformatics at the Broad Institute of Massachusetts Institute of Technology and Harvard University, has
designed a way to make computational work repeatable by other scientists.
The software, called GenePattern, stores both data and analytical routines. As the researcher works to collect and analyze the data, GenePattern
records the steps the scientist has taken, so that anyone else can follow the steps and check the result or expand on the method using new data. Ms.
Mesirov said that more than 6,000 people from more than 100 countries use the software.
She is now working with Microsoft to link such information to manuscripts that could be published online by peer-reviewed journals, to give
readers access to a researcher's computational methods. "One of the problems with publishing a paper that relies heavily on computational work,"
she said, "is that all of the methods that you would need to reproduce it never appear in the journal. If you're lucky, they're in the supplementary material
[online]. How much better if the journal had a link to the paper which had the data and an instantiation of the method embedded right in that paper.”
33. How can we be sure we’ll remember our digital past?
Christian Science Monitor
By Chris Gaylord
February 13th 2008
http://www.csmonitor.com/2008/0214/p13s02-stct.html
Fading media, formats
The problem of digital preservation reaches across two standards. There's the media – floppies, CDs, hard drives – and the format of the files
themselves – does it run in DOS, Hypercard, ClarisWorks 2.0?
Microsoft tackles this issue of "legacy" computing by running a kind of corporate museum. The company protects its multiplatform history by
preserving old copies of "every major hardware and software change," says Lee Dirks, director of Scholarly Communications at Microsoft and a task
force member.
"We've got computers stored on campus that go back to the Altair, the first computer [to run Microsoft software]," he says. "In fact, we bought
multiple copies of the Altair just in case."
But maintaining antique computers is a costly way to keep the past alive.
A concept that is gaining momentum, Mr. Dirks says, is emulation, where programmers trick modern computers into thinking the way
their classic cousins did. This lets them run old software without retro machines. Another problem arises when the emulator itself is
written for last generation's operating systems. Do you write an emulator to handle the original emulator?
A more likely approach to long-term preservation is migration, says Berman. This calls for updating the file format every generation –
without changing the contents, one hopes. This method has problems, as well. Some of the original context will be lost in translation,
says Dirks. Also, the scale of the conversation will snowball as the number, size, and back-catalog of the files increases with each
passing generation of technology.
35. • “Global Research Library 2020” with University of Washington
(Oct07 and Mar08)
• Participating in two application(s) to the final round of the NSF
“DataNet” solicitation (as an unfunded partner)
• Sponsoring BioMed Central’s 2007 Research Awards (Mar08)
• Aug07 Issue of CT Watch Quarterly (v. 3, no. 3)
“The Coming Revolution in Scholarly Communications & Cyberinfrastructure”
http://www.ctwatch.org/quarterly/articles/2007/08/
• New Scholarly Publishing website at:
– http://www.microsoft.com/mscorp/tc/scholarly-publishing.mspx