1. The document discusses the challenges and opportunities of enabling public access to publicly funded research, including papers, data, and research processes.
2. It notes that while making papers openly accessible through author manuscript deposit or open access journals has made progress, sharing full data and detailed methodological descriptions remains much more difficult.
3. The document argues that as a major public funder of research, STFC should lead the effort to establish international standards and infrastructure for open sharing of all research outputs and processes in order to maximize the return on public investment in science.
This document discusses open access to scientific research data. It notes that scientific research is increasingly data-driven and large-scale, especially in fields like high-energy physics, astronomy, and biology. However, inadequate access to research data is a problem, limiting opportunities to reuse data and validate or build upon past findings. The document examines some incentive-based approaches and key developments related to improving data sharing. It provides examples of large-scale data generation projects and challenges around managing and analyzing big data. Overall, the document argues that unrestricted sharing of scientific data deposited in the public domain could accelerate research and advance knowledge.
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain
The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can "understand and satisfy the requests of people and machines to use the web content" – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 70 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.
This thesis presents a comprehensive solution to address the issue of alignment and relationship identification using a bootstrapping based approach. By alignment we mean the process of determining correspondences between classes and properties of ontologies. We identify subsumption, equivalence and part-of relationship between classes. The work identifies part-of relationship between instances. Between properties we will establish subsumption and equivalence relationship. By bootstrapping we mean the process of being able to utilize the information which is contained within the datasets for improving the data within them. The work showcases use of bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.
This document discusses open science and open data requirements. It notes that funders like NIH now require data sharing plans for large grants and journals require data to be shared publicly. Future policies like FASTR aim to make federally funded research results freely available. Researchers are encouraged to use repositories like the Allen Institute to share data in discoverable, accessible, intelligible, assessable and usable ways. Institutions like OHSU aim to help researchers manage data sharing requirements and make their data more openly available and meaningful through initiatives like Open Insight. While some researchers may be hesitant to share data, doing so can help work towards the common goals of increasing transparency, reproducibility, and value from research efforts.
Paradise Lost and The Right to Read is the Right to Minepetermurrayrust
Presented to UIUC CIRSS seminars to a mixed group of Library, CS, domain scientists with a great contingent of Early Career Researchers. Starts by honouring the creation of the wonderful NCSA Mosaic at UIUC in 1993 and the paradise of knowledge and community it opened. Then shows the gradual and tragic decline of the web into a megacorporate neocolonialist empire, where knowledge is sacrificed for money and power.
You have seen many of the slides before but the words are different and have been recorded.
data management, information management, data, big data, personal organization, organization, file management, scientific research, research, project management, data security, file naming conventions, data management plan,
Philip Bourne gave a lecture on accelerating scientific discovery. He argued that data and knowledge need to interoperate better through improved publications and data archives. Researchers also need better tools to analyze, visualize and annotate data, and reward systems must change to incentivize new forms of scholarship like data curation. Scientist management tools are needed to help researchers better organize and share their work, findings and collaborations. The full power of internet technologies like video should also be leveraged more.
1. The document discusses the challenges and opportunities of enabling public access to publicly funded research, including papers, data, and research processes.
2. It notes that while making papers openly accessible through author manuscript deposit or open access journals has made progress, sharing full data and detailed methodological descriptions remains much more difficult.
3. The document argues that as a major public funder of research, STFC should lead the effort to establish international standards and infrastructure for open sharing of all research outputs and processes in order to maximize the return on public investment in science.
This document discusses open access to scientific research data. It notes that scientific research is increasingly data-driven and large-scale, especially in fields like high-energy physics, astronomy, and biology. However, inadequate access to research data is a problem, limiting opportunities to reuse data and validate or build upon past findings. The document examines some incentive-based approaches and key developments related to improving data sharing. It provides examples of large-scale data generation projects and challenges around managing and analyzing big data. Overall, the document argues that unrestricted sharing of scientific data deposited in the public domain could accelerate research and advance knowledge.
Prateek Jain dissertation defense, Kno.e.sis, Wright State UniversityPrateek Jain
The recent emergence of the “Linked Data” approach for publishing data represents a major step forward in realizing the original vision of a web that can "understand and satisfy the requests of people and machines to use the web content" – i.e. the Semantic Web. This new approach has resulted in the Linked Open Data (LOD) Cloud, which includes more than 70 large datasets contributed by experts belonging to diverse communities such as geography, entertainment, and life sciences. However, the current interlinks between datasets in the LOD Cloud – as we will illustrate – are too shallow to realize much of the benefits promised. If this limitation is left unaddressed, then the LOD Cloud will merely be more data that suffers from the same kinds of problems, which plague the Web of Documents, and hence the vision of the Semantic Web will fall short.
This thesis presents a comprehensive solution to address the issue of alignment and relationship identification using a bootstrapping based approach. By alignment we mean the process of determining correspondences between classes and properties of ontologies. We identify subsumption, equivalence and part-of relationship between classes. The work identifies part-of relationship between instances. Between properties we will establish subsumption and equivalence relationship. By bootstrapping we mean the process of being able to utilize the information which is contained within the datasets for improving the data within them. The work showcases use of bootstrapping based methods to identify and create richer relationships between LOD datasets. The BLOOMS project (http://wiki.knoesis.org/index.php/BLOOMS) and the PLATO project, both built as part of this research, have provided evidence to the feasibility and the applicability of the solution.
This document discusses open science and open data requirements. It notes that funders like NIH now require data sharing plans for large grants and journals require data to be shared publicly. Future policies like FASTR aim to make federally funded research results freely available. Researchers are encouraged to use repositories like the Allen Institute to share data in discoverable, accessible, intelligible, assessable and usable ways. Institutions like OHSU aim to help researchers manage data sharing requirements and make their data more openly available and meaningful through initiatives like Open Insight. While some researchers may be hesitant to share data, doing so can help work towards the common goals of increasing transparency, reproducibility, and value from research efforts.
Paradise Lost and The Right to Read is the Right to Minepetermurrayrust
Presented to UIUC CIRSS seminars to a mixed group of Library, CS, domain scientists with a great contingent of Early Career Researchers. Starts by honouring the creation of the wonderful NCSA Mosaic at UIUC in 1993 and the paradise of knowledge and community it opened. Then shows the gradual and tragic decline of the web into a megacorporate neocolonialist empire, where knowledge is sacrificed for money and power.
You have seen many of the slides before but the words are different and have been recorded.
data management, information management, data, big data, personal organization, organization, file management, scientific research, research, project management, data security, file naming conventions, data management plan,
Philip Bourne gave a lecture on accelerating scientific discovery. He argued that data and knowledge need to interoperate better through improved publications and data archives. Researchers also need better tools to analyze, visualize and annotate data, and reward systems must change to incentivize new forms of scholarship like data curation. Scientist management tools are needed to help researchers better organize and share their work, findings and collaborations. The full power of internet technologies like video should also be leveraged more.
The document summarizes the Chemist's Toolkit for publishing and promoting work online. It discusses open access publishing models, federal funding reporting mandates, retaining rights through author addenda, copyright and creative commons licensing. The toolkit contents are changing as publishing models evolve with new technologies, and it's important to maintain the toolkit by staying aware of developments. Globalization is increasing international collaborations which impacts cultural expectations around publishing.
Transcript - DOIs to support citation of grey literatureARDC
24th May 2017
This webinar was the first in a series examining persistent identifiers and their use in research. It begins with a brief introduction on the use of persistent identifiers in research followed by an outline of how UNSW has approached supporting discovery and citation of grey literature.
Watch the full webinar: https://www.youtube.com/watch?v=TLXYwrBu8wc
This document discusses the roles that cloud computing and virtualization can play in reproducible research. It notes that virtualization allows for capturing the full computational environment of an experiment. The cloud builds on this by providing scalable resources and services for storage, computation and managing virtual machines. Challenges include costs, handling large datasets, and cultural adoption issues. Databases in the cloud may help support exploratory analysis of large datasets. Overall, the cloud shows promise for improving reproducibility by enabling sharing of full experimental environments and resources for computationally intensive analysis.
The document provides a summary of a lecture on CSCW in times of change and social media. The lecture discusses how CSCW and social media are transforming organizations into networked structures and how personalization of data is enabling personalized paths for consumers. It also explores applications of these changes in domains like science and health, and outlines future challenges in areas like open science, linked data, and mobile technologies.
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk reviews emerging big data sources for social scientific analysis and explores the challenges these present. Many of these sources pose distinct challenges for acquisition, processing, analysis, inference, sharing, and preservation.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology. Dr. Altman is also a Non-Resident Senior Fellow at The Brookings Institution. Prior to arriving at MIT, Dr. Altman served at Harvard University for fifteen years as the Associate Director of the Harvard-MIT Data Center, Archival Director of the Henry A. Murray Archive, and Senior Research Scientist in the Institute for Quantitative Social Sciences.
Dr. Altman conducts research in social science, information science and research methods -- focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.
This is a presentation I gave at the Library of Congress as part of a NFAIS/FLICC/CENDI meeting as outlined here: http://www.chemspider.com/blog/making-the-web-work-for-science-presentation-at-the-library-of-congress.html
The presentation provides an overview of some of the challenges the publishers face moving forward, how they are responding to it, how InChI is an enabling technology, how quality is important.
Copyright is one of the greatest barrier to Open Data. This presentation for insidegovernment UK shows the struggle between those who want to reform copyright and those opposed to reform
5 steps to using open access in the classroom 11 9 2011 Elizabeth Brown
The document discusses open educational resources and open content. It begins by outlining limitations to open content and then provides a five step process for creating open content: 1) identify open content, 2) assess the value of information, 3) create open content, 4) share open content with peers, and 5) preserve open content. It then discusses various tools and platforms for creating, sharing, and preserving open content. The document concludes by emphasizing that creating open content is an iterative process and provides additional advice.
Keynote talk presented at WebScience 2020 conference. Looks at roots of Web/Web Science and explores two possible futures and what web scientists and others can do about it. Even starts with a quote from Charles Dickins.
Research Data and Scholarly Communication (with notes)Dorothea Salo
The document discusses how Dorothea Salo will talk about how the emphasis on research data is changing scholarly communication. When people think of scholarly communication, they typically think of books, journals, and published literature. However, as more research data is collected and stored digitally, data is becoming a first-class citizen in scholarly communication on par with traditional published literature. The merging of data and traditional scholarly communication is analogous to the "you got your peanut butter in my chocolate" commercials, with data and scholarly communication intermingling and enhancing each other.
Informatics Transform : Re-engineering Libraries for the Data DecadeLiz Lyon
Libraries need to re-engineer to support the data decade by providing research data management services and developing data informatics capacity. This includes offering data management plans, metadata support, data storage, and tools for data tracking and citation. Libraries also need to work with researchers and partners to understand data requirements, provide advocacy and training, and help acquire skills in areas like data preservation, analysis, and visualization. As data becomes more important, libraries are on a journey to develop these research data management capabilities.
Bill Howe discussed emerging topics in responsible data science for the next decade. He described how the field will focus more on what should be done with data rather than just what can be done. Specifically, he talked about incorporating societal constraints like fairness, transparency and ethics into algorithmic decision making. He provided examples of unfair outcomes from existing algorithms and discussed approaches to measure and achieve fairness. Finally, he discussed the need for reproducibility in science and potential techniques for more automatic scientific claim checking and deep data curation.
The document discusses building an open knowledge base called WikiFactMine that is controlled by researchers rather than large corporations. It notes that scholarly publishing produces a large amount of "Big Data" each year, most of which is not publicly readable or fully utilized. Content mining can help liberate this knowledge by extracting facts from millions of articles each week. However, large publishers often oppose content mining. The document advocates for supporting content mining and researchers to address issues like climate change and improve science. It notes that content mining software can extract information much faster than manual review.
DataCite and Campus Data Services
Paul Bracke, Associate Dean for Digital Programs and Information Services, Purdue University
Research libraries are increasingly interested in developing data services for their campuses. There are many perspectives, however, on how to develop services that are responsive to the many needs of scientists; sensitive to the concerns of scientists who are not always accustomed to sharing their data; and that are attractive to campus administrators. This presentation will discuss the development of campus-based data services programs, the centrality of data citation to these efforts, and the ways in which engagement with DataCite can enhance local programs.
Social Machines of Scholarly CollaborationDavid De Roure
The document discusses shifts in scholarly communication and the potential for "social machines" to address issues with the current system. It notes the end of traditional scholarly articles due to limitations of containers and reconstruction. Future systems could involve computationally-enabled networks of expertise, data, and narratives among humans and machines. Well-designed social machines may help address challenges around reproducibility, reuse and innovation in research.
Digital Identity is fundamental to collaboration in bioinformatics research and development because it enables attribution, contribution, publication to be recorded and quantified.
However, current models of identity are often obsolete and have problems capturing both small contributions "microattribution" and large contributions "mega-attribution" in Science. Without adequate identity mechanisms, the incentive for collaboration can be reduced, and the utility of collaborative social tools hindered.
Using examples of metabolic pathway analysis with the taverna workbench and myexperiment.org, this talk will illustrate problems and solutions to identifying scientists accurately and effectively in collaborative bioinformatics networks on the Web.
The document summarizes the Chemist's Toolkit for publishing and promoting work online. It discusses open access publishing models, federal funding reporting mandates, retaining rights through author addenda, copyright and creative commons licensing. The toolkit contents are changing as publishing models evolve with new technologies, and it's important to maintain the toolkit by staying aware of developments. Globalization is increasing international collaborations which impacts cultural expectations around publishing.
Transcript - DOIs to support citation of grey literatureARDC
24th May 2017
This webinar was the first in a series examining persistent identifiers and their use in research. It begins with a brief introduction on the use of persistent identifiers in research followed by an outline of how UNSW has approached supporting discovery and citation of grey literature.
Watch the full webinar: https://www.youtube.com/watch?v=TLXYwrBu8wc
This document discusses the roles that cloud computing and virtualization can play in reproducible research. It notes that virtualization allows for capturing the full computational environment of an experiment. The cloud builds on this by providing scalable resources and services for storage, computation and managing virtual machines. Challenges include costs, handling large datasets, and cultural adoption issues. Databases in the cloud may help support exploratory analysis of large datasets. Overall, the cloud shows promise for improving reproducibility by enabling sharing of full experimental environments and resources for computationally intensive analysis.
The document provides a summary of a lecture on CSCW in times of change and social media. The lecture discusses how CSCW and social media are transforming organizations into networked structures and how personalization of data is enabling personalized paths for consumers. It also explores applications of these changes in domains like science and health, and outlines future challenges in areas like open science, linked data, and mobile technologies.
BROWN BAG TALK WITH MICAH ALTMAN, SOURCES OF BIG DATA FOR SOCIAL SCIENCESMicah Altman
This talk, is part of the MIT Program on Information Science brown bag series (http://informatics.mit.edu)
This talk reviews emerging big data sources for social scientific analysis and explores the challenges these present. Many of these sources pose distinct challenges for acquisition, processing, analysis, inference, sharing, and preservation.
Dr Micah Altman is Director of Research and Head/Scientist, Program on Information Science for the MIT Libraries, at the Massachusetts Institute of Technology. Dr. Altman is also a Non-Resident Senior Fellow at The Brookings Institution. Prior to arriving at MIT, Dr. Altman served at Harvard University for fifteen years as the Associate Director of the Harvard-MIT Data Center, Archival Director of the Henry A. Murray Archive, and Senior Research Scientist in the Institute for Quantitative Social Sciences.
Dr. Altman conducts research in social science, information science and research methods -- focusing on the intersections of information, technology, privacy, and politics; and on the dissemination, preservation, reliability and governance of scientific knowledge.
This is a presentation I gave at the Library of Congress as part of a NFAIS/FLICC/CENDI meeting as outlined here: http://www.chemspider.com/blog/making-the-web-work-for-science-presentation-at-the-library-of-congress.html
The presentation provides an overview of some of the challenges the publishers face moving forward, how they are responding to it, how InChI is an enabling technology, how quality is important.
Copyright is one of the greatest barrier to Open Data. This presentation for insidegovernment UK shows the struggle between those who want to reform copyright and those opposed to reform
5 steps to using open access in the classroom 11 9 2011 Elizabeth Brown
The document discusses open educational resources and open content. It begins by outlining limitations to open content and then provides a five step process for creating open content: 1) identify open content, 2) assess the value of information, 3) create open content, 4) share open content with peers, and 5) preserve open content. It then discusses various tools and platforms for creating, sharing, and preserving open content. The document concludes by emphasizing that creating open content is an iterative process and provides additional advice.
Keynote talk presented at WebScience 2020 conference. Looks at roots of Web/Web Science and explores two possible futures and what web scientists and others can do about it. Even starts with a quote from Charles Dickins.
Research Data and Scholarly Communication (with notes)Dorothea Salo
The document discusses how Dorothea Salo will talk about how the emphasis on research data is changing scholarly communication. When people think of scholarly communication, they typically think of books, journals, and published literature. However, as more research data is collected and stored digitally, data is becoming a first-class citizen in scholarly communication on par with traditional published literature. The merging of data and traditional scholarly communication is analogous to the "you got your peanut butter in my chocolate" commercials, with data and scholarly communication intermingling and enhancing each other.
Informatics Transform : Re-engineering Libraries for the Data DecadeLiz Lyon
Libraries need to re-engineer to support the data decade by providing research data management services and developing data informatics capacity. This includes offering data management plans, metadata support, data storage, and tools for data tracking and citation. Libraries also need to work with researchers and partners to understand data requirements, provide advocacy and training, and help acquire skills in areas like data preservation, analysis, and visualization. As data becomes more important, libraries are on a journey to develop these research data management capabilities.
Bill Howe discussed emerging topics in responsible data science for the next decade. He described how the field will focus more on what should be done with data rather than just what can be done. Specifically, he talked about incorporating societal constraints like fairness, transparency and ethics into algorithmic decision making. He provided examples of unfair outcomes from existing algorithms and discussed approaches to measure and achieve fairness. Finally, he discussed the need for reproducibility in science and potential techniques for more automatic scientific claim checking and deep data curation.
The document discusses building an open knowledge base called WikiFactMine that is controlled by researchers rather than large corporations. It notes that scholarly publishing produces a large amount of "Big Data" each year, most of which is not publicly readable or fully utilized. Content mining can help liberate this knowledge by extracting facts from millions of articles each week. However, large publishers often oppose content mining. The document advocates for supporting content mining and researchers to address issues like climate change and improve science. It notes that content mining software can extract information much faster than manual review.
DataCite and Campus Data Services
Paul Bracke, Associate Dean for Digital Programs and Information Services, Purdue University
Research libraries are increasingly interested in developing data services for their campuses. There are many perspectives, however, on how to develop services that are responsive to the many needs of scientists; sensitive to the concerns of scientists who are not always accustomed to sharing their data; and that are attractive to campus administrators. This presentation will discuss the development of campus-based data services programs, the centrality of data citation to these efforts, and the ways in which engagement with DataCite can enhance local programs.
Social Machines of Scholarly CollaborationDavid De Roure
The document discusses shifts in scholarly communication and the potential for "social machines" to address issues with the current system. It notes the end of traditional scholarly articles due to limitations of containers and reconstruction. Future systems could involve computationally-enabled networks of expertise, data, and narratives among humans and machines. Well-designed social machines may help address challenges around reproducibility, reuse and innovation in research.
Digital Identity is fundamental to collaboration in bioinformatics research and development because it enables attribution, contribution, publication to be recorded and quantified.
However, current models of identity are often obsolete and have problems capturing both small contributions "microattribution" and large contributions "mega-attribution" in Science. Without adequate identity mechanisms, the incentive for collaboration can be reduced, and the utility of collaborative social tools hindered.
Using examples of metabolic pathway analysis with the taverna workbench and myexperiment.org, this talk will illustrate problems and solutions to identifying scientists accurately and effectively in collaborative bioinformatics networks on the Web.
The document discusses the need for a common researcher identification system and the genesis of ORCID (Open Researcher and Contributor ID). Key points:
1) A 2009 summit with 21 organizations discussed challenges of identifying researchers across disciplines due to common names. This led to the idea of ORCID, an independent non-profit for a shared researcher registry.
2) Early support from publishers, libraries, and research organizations was seen as critical for adoption.
3) The vision was a system-wide standard to facilitate identification, collaboration and validation for the global research community.
4) Next steps involved formalizing the ORCID organization and exploring technology and business models to make the registry sustainable.
CSHL Press publishes scientific journals and books. They started a blog called Bench Marks to promote their new journal CSH Protocols and put a human face on their organization. However, blogs require a significant time commitment and it's unclear who actually reads science blogs. While blogs can promote content and discussion, online communities are dominated by a small percentage of highly engaged users and most readers are passive. It's important for bloggers to understand their audience and goals for blogging.
Do Libraries Meet Research 2.0 : collaborative tools and relevance for Resear...Guus van den Brekel
Presentation June 30th 2009 Toulouse at LIBER Conference 2009
http://liber2009.biu-toulouse.fr/
Research Libraries & Web 2.0. Scientists engage in science & research 2.0, libraries should follow, outreach, engage, explore and facilitate etc
Globlal Perspective on Open Research: A Bird's Eye ViewLeslie Chan
Presentation at the University of Cape Town, Aug. 5, 2011. This talk was part of the OpenUCT initiative and the Scholarly Communication in Africa Programme. It provides an overview of the changing research landscape and the particular importance of open access and other forms of open collaboration for solving some of the pressing problems of development research. The presentation argues for the importance of policy development in support of research collaboration and the development of enriched metrics for evaluating the development impact of research.
The wider environment of open scholarship – Jisc and CNI conference 10 July ...Jisc
1. The document discusses shifts in scholarship towards more open and collaborative models enabled by digital technologies, including the end of traditional scholarly articles and emergence of "social machines" involving both humans and machines.
2. It proposes a new model of scholarly communication called "social objects" that are part of a computational network of expertise, data, and narratives maintained by both humans and machines.
3. Key aspects of this new model include research objects that encode the full scholarly process and outputs, and social machines that empower researchers through collaborative and automated curation of the scholarly record.
myExperiment and the Rise of Social MachinesDavid De Roure
Talk at hubbub 2012, Indianapolis, 25 September 2012. The talk introduces myExperiment and Wf4Ever, discusses the future of research communication including FORCE11, and introduces the SOCIAM project (Theory and Practice of Social Machines) which launches in October 2012.
Published on Jan 29, 2016 by PMR
Keynote talk to LEARN (LERU/H2020 project) for research data management. Emphasizes that problems are cultural not technical. Promotes modern approaches such as Git / continuous Integration, announces DAT. Asserts that the Right to Read in the Right to Mine. Calls for widespread development of content mining (TDM)
The Culture of Research Data, by Peter Murray-RustLEARN Project
1st LEARN Workshop. Embedding Research Data as part of the research cycle. 29 Jan 2016. Presentation by Peter Murray-Rust, ContentMine.org and University of Cambridge
This document discusses open data and open science. It highlights Jean-Claude Bradley as a pioneer of open notebook science and open data who believed closed data means people die. It describes tools like ContentMine that can automatically extract data like chemical reactions, phylogenetic trees and clinical trial results from papers. Visitors can extract specific types of data while repositories can solve problems communally with continuous publication and validation.
Open Knowledge and University of Cambridge European Bioinformatics InstituteTheContentMine
This document discusses open data and open science. It highlights Jean-Claude Bradley as a pioneer of open notebook science and open data who believed closed data means people die. It describes tools like ContentMine that can automatically extract data like chemical reactions, phylogenetic trees and clinical trial results from papers. Visitors can extract specific types of data while repositories can solve problems communally with continuous publication and validation.
Open Research Practices in the Age of a Papermill PandemicDorothy Bishop
Talk given to Open Research Group, Maynooth University, October 2022.
Describes the phenomenon of large-scale fraudulent science publishing (papermills), and discusses how open science practices can help tackle this.
Strategic scenarios in digital content and digital businessMarco Brambilla
This document provides an overview of strategic scenarios in digital contents. It discusses the evolution from static to dynamic contents, from fixed to mobile, and from local to global. It also covers the rise of Web 2.0, including the growth of user-generated content, tagging, blogs, wikis, podcasts and other social media tools. Finally, it discusses some tools that enable collaboration and information sharing, such as WebEx, and the trend toward mashups that combine multiple web services.
The document discusses technology-mediated social participation and outlines the goals and challenges of the Summer Social Webshop. It summarizes that the Webshop aims to (1) clarify national priorities, (2) develop research questions around social participation, and (3) promote novel research methodologies to influence national policy and increase educational opportunities. It also notes key challenges include malicious attacks, privacy violations, lack of trust, and failure to be universally accessible.
BioMed Central is a large open access publisher that is committed to open data initiatives. They have implemented several solutions to promote open data practices, including data journals, an open data award, and enabling data citation. They also work to integrate data hosting and deposition, address data licensing issues, and provide guidance on best practices. Future goals include adding more value to text and data mining applications and building business models around open data.
The document discusses the data era of massive information production and challenges of extracting knowledge from data. It describes the growth of digital data and potential economic value of big data. Both syntactic approaches like visualizations and semantic approaches using structured data are needed to help humans and machines understand and make use of large amounts of data. Linked open data and open government data initiatives are helping to make large data sources structured and interconnected on the web.
Linked Open Data in Libraries, Archives & MuseumsJon Voss
This document provides an overview of Linked Open Data for libraries, archives, and museums. It discusses the growing movement of LODLAM and how it allows these cultural institutions to represent their data as graphs using triples that describe entities in a machine-readable format. Key concepts covered include the use of URIs, RDF, vocabularies, and different legal tools for publishing open data.
Scott Edmunds slides for class 8 from the HKU Data Curation (module MLIM7350 from the Faculty of Education) course covering science data, medical data and ethics, and the FAIR data principles.
Semelhante a A Cabinet Of Web2.0 Scientific Curiosities (20)
Mendeley uses activity data to determine how researchers use and share documents after accessing them. It provides a PDF and reference management tool that allows users to organize, annotate, and collaborate on research papers. Mendeley aggregates usage data on over 28 million research papers in the cloud to determine reading trends and recommend related papers to users. It aims to make science more collaborative and transparent.
Unveiling the web, making the implicit explicit.Ian Mulvany
This talk was given on the 9th of August 2010 at the American Phytopathological Society's annual conference in Charolette North Carolina.
I talk about how the commodotisation of emerging tools on the web, such as the semantic web and scalable architectures, may have an effect on the communication and practice of science.
presentation of Mendeley give at the JISC sponsored TELSTAR reading list event, Cambridge 2010. This talk details the Mendeley client, and points out some interesting API methods.
Brief 5 min presentation give to school students coming in to see how technology is used in industry. I'm just posting these slides here so they can grab a copy.
Short presentation from a working group at the 2008 social web communities workshop held in September 2008 at the Dagstuhl in Saarbrucken. The presentation discusses the social aspects of the kinds of tools that could be built once a connected web of data was easily mined.
This document contrasts the evolution of the internet and web technologies from early implementations to modern machine-learning-powered versions. It discusses the shift from static web pages and centralized content to user-generated content, social networking, tagging, and semantic linking of data. Examples provided track this progression from early platforms like Ofoto and Britannica Online to modern equivalents like Flickr, Wikipedia, blogging, and semantic analysis tools that integrate decentralized online information.
Web 2.0 is not only about making sites easier for people to interact with, but it is also about creating webs of data that machines can also interact with. These slides looks at a few examples of technologies that can help weave the data web, and shows some example applications, with a focus on science.
This is an edited version of a talk that I gave on the 11th of February to some PhD students from the University of Utrecht at a seminar on science and communication.
Digital Library Federation, Fall 07, Connotea PresentationIan Mulvany
The document provides information about Connotea, a social bookmarking site focused on academic bookmarking and citations. It discusses features of Connotea like private groups, importing/exporting for writing, and citation data. It also includes URLs and usage statistics for Connotea, as well as potential future developments like better tagging, recommendations, and integrating with bibliographies.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Speck&Tech
ABSTRACT: A prima vista, un mattoncino Lego e la backdoor XZ potrebbero avere in comune il fatto di essere entrambi blocchi di costruzione, o dipendenze di progetti creativi e software. La realtà è che un mattoncino Lego e il caso della backdoor XZ hanno molto di più di tutto ciò in comune.
Partecipate alla presentazione per immergervi in una storia di interoperabilità, standard e formati aperti, per poi discutere del ruolo importante che i contributori hanno in una comunità open source sostenibile.
BIO: Sostenitrice del software libero e dei formati standard e aperti. È stata un membro attivo dei progetti Fedora e openSUSE e ha co-fondato l'Associazione LibreItalia dove è stata coinvolta in diversi eventi, migrazioni e formazione relativi a LibreOffice. In precedenza ha lavorato a migrazioni e corsi di formazione su LibreOffice per diverse amministrazioni pubbliche e privati. Da gennaio 2020 lavora in SUSE come Software Release Engineer per Uyuni e SUSE Manager e quando non segue la sua passione per i computer e per Geeko coltiva la sua curiosità per l'astronomia (da cui deriva il suo nickname deneb_alpha).
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Your One-Stop Shop for Python Success: Top 10 US Python Development Providersakankshawande
Simplify your search for a reliable Python development partner! This list presents the top 10 trusted US providers offering comprehensive Python development services, ensuring your project's success from conception to completion.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Webinar: Designing a schema for a Data WarehouseFederico Razzoli
Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you.
A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services.
But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts.
We will discuss these topics:
- How to gather information about a business;
- Understanding dictionaries and how to identify business entities;
- Dimensions and facts;
- Setting a table granularity;
- Types of facts;
- Types of dimensions;
- Snowflakes and how to avoid them;
- Expanding existing dimensions and facts.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
OpenID AuthZEN Interop Read Out - AuthorizationDavid Brossard
During Identiverse 2024 and EIC 2024, members of the OpenID AuthZEN WG got together and demoed their authorization endpoints conforming to the AuthZEN API
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Programming Foundation Models with DSPy - Meetup Slides
A Cabinet Of Web2.0 Scientific Curiosities
1. A Cabinet of Web 2.0 Scientific
Curiosities
Ian Mulvany, Product Development
Manager, Nature Publishing Group
This talk takes a tour through science related web 2.0 efforts and discusses
areas of the practice of science that can be impacted through web 2.0 approaches.
A video of this presentation will be posted at http://videolectures.net/
2. Some of the people involved
• Timo Hannay - Director Nature.com
• Jason Wilde - Publisher Physical Sciences
• Amanda Ward - Head of Platform Technologies
• Tony Hammond - Applications Architect
• Alf Eaton - Product Development Manager
• Euan Adie - Product Development Manager
• Gavin Bell - Product Development Manager
• Hilary Spencer - Product Development Manager
• Ian Mulvany - Product Development Manager
3. • Publishing Industry Facts & Figures
• Nature
• (Some) Issues that Web 2.0 can impact
• Identity and Authority
• Content Discovery
• Citizen Science
• Google Wave
• Ongoing Challenge
• The Future
6. Costs of research Source: Research
Information Network
A significant contribution to the total cost of research is the time
required for researchers to find the appropriate material for reading.
There is an opportunity here to decrease such costs through creation
of better tools for information discovery.
source http://www.rin.ac.uk/
8. • "It is intended, first, to place before
the general public the grand results
of scientific work and scientific
discovery"
• "to aid scientific men ... by affording Norman Lockyer
them an opportunity of discussing
the various scientific questions that
arise from time to time"
Nature is principally a scientific communication company.
We have to engage with the methods of communication that are important for science.
If we started today our starting point would naturally be the web, and not a print journal.
9. (Some) Publishing
Milestones
• 1896, Wilhelm Röntgen, X-Rays
• 1925, Raymond Dart , Australopithecus africanus
• 1938, P Kapitza, Superfluidity
• 1953, J D Watson and F H C Crick, DNA
• 1985, J C Farman, B G Gardiner and J D Shanklin, Ozone
Hole
• 1995, Michel Mayor and Didier Queloz, Extra Solar Planets
• 2001, Human Genome
10. Journal Evolution
•1869 Journal Founded
•1899 Journal Makes a Profit
•1967 Peer Review
•1971 First Expansion (until 1974)
•1992 Nature Genetics
•1995 Holzbrink Ownership
•1995 Nature.com
•2004 Connotea
•2007 Nature Network
Peer review only introduced in 1967 in order
to deal with a backlog of about 3000 manuscripts.
11. Our current list of publications:
http://www.nature.com/siteindex/index.html
13. 2.0
Web 2.0 is about getting and using data.
There are two aspects, one is about lowering the barrier for participation,
and the second is about data mining the resultant information in order
to provide better services or tools.
This can also lead to a strong first mover advantage, as the network of data
or participation gets bigger the value in the network gets bigger
14. Web 1.0
Web 2.0
DoubleClick
Google AdSense
Ofoto
Flickr
Akamai
BitTorrent
mp3.com
Napster
Britannica Online
Wikipedia
personal websites
blogging
evite
upcoming.org and EVDB
domain name speculation
search engine optimization
page views
cost per click
screen scraping
web services
publishing participation
CMS wikis
directories (taxonomy)
tagging (folksonomy)
stickiness syndication
19. image credit sam brown, explodingdog
Should be aware not to focus on just the technology
" " Building for Machines:
" " " Semantic Markup
" " " Well documented API's
" " "
" " Building for Humans:"
" " " reduce the barrier to participation
" " " increase the usefulness of serendipity and recommendation
20. Stay Classy, SXSW:
Building Respectful
Software
http://panelpicker.sxsw.com/ideas/view/3691?return=%2Fideas%2Findex
%2Finteractive%2Fq%3Abuilding+respectful
make your software respectful
http://panelpicker.sxsw.com/ideas/view/3691?return=%2Fideas%2Findex%2Finteractive%2Fq%3Abuilding+respectful
21. “ While scientists have gloried in the disruptive
effect that the Web is having on publishers and
libraries, with many fields strongly pushing
open publication models, we are much more
resistant to letting it be a disruptive force in
the practice of our disciplines.”
Jim Hendler
Scientists resist
Although the idea of a data driven approach should have an appeal to scientists,
science changes slowly. There are a lot of implicit norms that are hard to change.
22. }
NIH requests all
Nature offers to
fundholders
upload to PubMed
deposit their 70% of
Central on behalf
manuscripts in scientists can’t
of authors with
PubMed Central even be
their permission
archive bothered to say
}
}
”yes”
4% compliance 30% compliance
Scientists resist
An example of low participation in open data models is the low uptake
of deposition of articles into pubmed.
23. Some Issues Where Web
2.0 May Help in Science
•
Identity and Reputation
•
Content Discovery
• Citizen Science
24. Humans
Public Academic
Machines
This is the framework that Iʼm going to be using to think about the topics in this
talk. These are just two dimensions against which one can look at things, there are many other
ways of looking at these issues. When putting together these slides I got interested in the
tension between machine oriented efforts and human oriented efforts on the web. In addition web 2.0
can have a big impact on public engagement with science, so I wanted to see if I could line up these two
trends together.
26. Identity on the web is a fractured thing. It makes it difficult to manage
all of the accounts that a person has, but on the other hand it makes it easy
to present different personas to different online communities.
27. 100, 000
Identity is a significant and growing issue in science. Each year India produces
100, 000 postdocs.
Full names are often not revealed owing to caste discrimination.
http://www.nature.com/nature/journal/v452/n7187/full/452530d.html
28. 1.1 Billion > 129
photo: Szymon Kochanski
129 surnames are shared by 1.1 billion people, 85% of the chinese population.
Generally identity is a self enforcing protocol.
Works most of the time, but ... Surgeon Liu Hui, padded his CV with publications by another researcher
who shared his surname and initial, rose to become an assistant dean at Tsinghua University.
Discrepancies were noticed and he was dismissed by the university in March 2006
29. http://www.mluvany.net
Scopus Author ID 6603325879
Thompson
Researcher ID B-2805-2008
CrossRef 62.1000/182
Contributor ID
These are currently the most commonly discussed options for managing identity within an academic
context, each has pros and cons, and none has gained enough momentum to be universally adopted.
Nature is currently taking a wait and see approach, but we would like to see an open system gaining adoption.
30. Why is the issue of identity important, for reputation!
31. 1619 - 1677
Henry Oldenburg, first secretary of the Royal Society, invented the practice of peer review with the
Transactions of the Philosophical Society.
His own reputation suffered, he was jailed for being a potential dutch spy
and thrown in the tower of london for a while.
32. TM
Impact Factor
IF (year) = A/B
A = # of articles published in (year -1)
+ (year - 2)
B = # of citations to journal in year
33. Impact factor measures an average statistic of a single journal.
80% of citations into a journal come from 20% of articles.
General agreement that IF is a poor measure of individual article quality.
35. doi/10.1371/journal.pone.0004803.g007
Other metrics can also reveal the connections between the sciences,
Bollen et al. used website access data from publisherʼs http logs to
look at how people browed the literature. This gave a more rounded picture
than just looking at citations.
36. There is a move to now look instead of at journal level metrics rather
37.
38. Citations
time
One thing that fascinates me about citations is that they
are unidirectional.
Also there must be more citations than papers, and yet 85% of papers
receive at most 1 citation.
39. Ideas
time
They can be used to study the flow of ideas forward in time.
40. Main-path analysis and path-dependent transitions
in HistCite™-based historiograms
Journal of the American Society for Information Science and Technology (forthcoming)
Diana Lucio-Arias1 & Loet Leydesdorff2
Amsterdam School of Communications Research (ASCoR), University of Amsterdam
Kloveniersburgwal 48, 1012 CX Amsterdam, The Netherlands.
This is the Main-Path Analysis technique, but as yet such analysis tends to
be done on a case by case basis.
41. 1
Cox, D.R. (1972) Regression models and
life-tables. J. Roy. Statist. Soc. B 34:
21 000
Some papers act as a kind of black hole for citations, they get into the literature
and get cited and cited and cited.
This paper has over 21 000 citations.
The mis-citations to this paper have a h-index of 12,
a level that Hirsch had concluded “…might be a typical value for advancement to tenure…”
http://network.nature.com/people/boboh/blog/2008/06/24/outdone-by-mis-prints
43. y
easy plain text, emails hyperlinks
Twitter views
tags
citations?
contributing
microformats
MicroFormats
(semantic web)
academic papers Semantic Web
hard mining easy
PDF sucks, academic papers are hard to create and PDF is hard to extract
any useful information from in a programatic way.
44. Humans Article Writing
Peer Review
Author Identification
Article Publishing
Public Academic
Machines
This is where most of the academic publishing workflow currently lives,
it is manual work that can only be done by highly trained experts.
45. XML
At nature we are consolidating all of our article content into a sigle XML
database.
46. Building a delivery
infrastructure http://www.flickr.com/photos/zhzheka/
We then deliver this content via print, RSS, paper, search queries,
to a host of endpoints.
47.
48. XML
Blue - Done
Green - Done within the last year
Yellow - coming to completion
Red - depreciated
49.
50. Extensible Containers
http://www.flickr.com/photos/cherieking/
We want to be able to extend the data that we deliver.
51. XML
Medline + MESH
We pull in MESH terms for our articles from medline post-publication.
52. Case Study: Nature Chemistry
We have started extracting entities from our Nature Chemistry journal, and
we hope to roll this program out to other journals.
53. HO
CAS – 50-67-9
NH 2
NH
Serotonin
SMILES – Oc1cc2c(cc1)ncc2CCN
InChI – 1S/C10H12N2O/c11-43-7-6-12-10-2-1-8(13)5-9
(7)10/h1-2,5-6,1 2-13H,3-4,11H2
InChIKey – QZAYGJVTTNCV MB-HFFFAOYSA-N
Chemistry is a visual science! molecules
cas #s first appeard in 1907, is owned by ACS, contains no semantics
smiles 1987, not unique to a compound
Inchi/Inchikey 200/2005
54. GIF/PNG
GIF/PNG
3D
Author file
Author file
Compound Data
CDX
CDX
55. Enhanced compound pages offer:
Chemdraw file
CML file
View structure in 3D
Synonyms
Chemical formula
Molecular Weight
Elemental Analysis
InChI and InChIKey
SMILES string
Links to external databases
56. PubChem
InChi
ChemSpider
We can start to link from articles into databases, and vice versa.
57. PubChem
ChemSpider
XML TXT
xpath
Medline
UIMA
+ MESH
Schematic of our current entity extraction workflow,
Initially we are extracting chemical and compound names form Nature Chemistry articles.
58. We have a bespoke interface that allows editorial curation of the
annotations.
59. <dl class="meta">
<dt>InChI</dt>
<dd class="inchi">InChI=1/
C10H14N5O7P.2Na/c11-8-5-9
(13-2-12-8)15(3-14-5)</dd>
</dl>
Making the markup of the bold numbers makes the online
version of the paper more semantic,
60. Organise metadata: create good architecture so
generated data can be easily reused across a
range of applications.
http://www.flickr.com/photos/timecollapse/
We hope to be able to extended the types of entities that
we are extracting from our articles.
61. Expanding the annotation of journal
articles from Nature Chemistry to
Nature Chemical Biology and then to
all NPG journals
Creating a central NPG database of
compounds and related journal
articles
62. InChI=1S/C32H16N8.Cu/c1-
N
2-10-18-17(9-1)25-33-
26(18)38-28-21-13-5-6-14-22
N N
(21)30(35-28)40-32-24-16-8-
N Cu N
7-15-23(24)31(36-32)39-29-
N N
20-12-4-3-11-19(20)
27(34-29)37-25;/h1-16H;
N
This then makes the article a more integrated object, with
links to databases, entities and the products of scientific research.
63. There are many curated databases that look for information about domain
specific results in the literature. An example is flybase that collects
information about results using the model organism Drosophila.
64. Wormbase does the same for C. elegans.
Both require a large amount of human curating. Having the body of scientific
literature be semantically annotated should help with this kind of curation.
65. Site such as Chemspider and Crystal Eye demonstrate what can be done though
data mining the literature.
66. So we have moved into a situation in which our scholarly network
can now connect to entity databases, rather than just to articles.
67. Humans Article Writing
Peer Review
Author Identification
Article Publishing
Public Academic
Entity Extraction
Machines
Article publishing hopefully becomes enriched through semantic markup and
entity extraction.
68. Getting Social
photo credit: flickr mcgeez
We can go beyond published articles and entities and look at
both other published artefacts and the social annotation that
is associated with them.
69. The amount of grey literature available in physics has grown
steadily, as displayed by submissions to the Physics ArXiVe.
70. Nature Precedings was the first preprint server for the life sciences.
It also includes the ability to vote and comment on submissions and
provides each submission with a unique identifier.
71. PLoS have launched PloS Currents: Influenza, based on top of Google Knol.
Both Preceedings and Currents have editorial curation of content, and allow
easy publication of objects such as posters, proceedings papers and white papers.
75. The Kind of Information that we can capture with Connotea includes full citation information
Usage patterns, (when did an item get added to our DB, how many times has it been added)
Extra meta-data such as tags
Potentially social network information, how many of my friends have added this item?
76. Total number of tags
Total number of unique tags
Growth in usage of the service has been steady
77. And it displays the characteristic power law behaviour of an online network.
83. http://www.connotea.org/data/user/IanMulvany
http://www.connotea.org/data/users/tag/scifoo
http://www.connotea.org/data/user/IanMulvany/tag/
scifoo
http://www.connotea.org/data/user/IanMulvany/tag/
science
http://www.connotea.org/data/user/IanMulvany/tag/
science2.0+citation
Example of API calls
84. There are plenty of other such services currently available.
Interestingly Fuzzy has the most semantically enabled technology, but is one of the least used.
85. A few start-ups are redefining the academic paper management
space, Papers is a mac based “iTunes” for Pdfs.
86. Mendeley provides the same kind of features, with a Last FM metadata scrobbling model.
87. This allows one to see data on what is being read in Mendeley libraries.
This starts to open up a new layer of information about the impact of papers
that goes beyond what can be captured by the impact factor.
88. Nature Network
Online social communities also allow us to begin to capture conversations about science.
NPG launched Nature Network in 2009 and is one of the most active online forums for
the discussion of science.
89. It has specific features to allow members to track the conversations that they
have participated in.
90. There are 3 main local hubs, but we track the geographic location of members,
and try to connect people with other members in their neighbourhood.
91. Bringing things together
photo: flickr Thomas Hawk
Q: How do you manage all of these streams of information?
A: Aggregation is one answer (probably not the only answer).
93. Nature blogs finds blog posts that discuss scientific articles.
Science Blogs and researchblogging.org do much the same.
94. Scinitalla is another Nature product that creates recommendations based
on a users reading habits.
95. Friend Feed aggregates discussions around resources from difference sources.
It has seen widespread adoption by the scientific digerati, the life scientists
room is one of the most active.
96. People are using these rooms to have real-time conversations around real-time
events. This broadcasts an event and the conversions around an event to the
web. It enables real time distant participation.
97. streamosphere.nature.com/preview.php is an aggregator for
discussions on twitter, friendfeed some other lightweight user signals.
It again aggregates over a curated list of sources.
98. So now we can see a world in which the article is no longer the
only digital artefact of note. Much more of the process of science
is becoming visible through online engagement of scientists.
99. Humans Article Writing
Peer Review
Author Identification
Article Publishing
Science Blogging/Tweeting/Social Communities
SIOC
Public Academic
Entity Extraction
Machines
Social media as it exists now is problematic
- effervescent
- closed
- siloed
- unstructured
Tools like SioC, an ontology for social media, can help draw this layer of information
to the machine.
101. Seti@home
Folding@home
“Thinking@home”
One kind of participatory science is getting users to donate their hardware.
102. 10 000 sheep, Aaron Koblin, 2006
You can also build interfaces to people, e.g. the Mechanical Turk.
The sheep market created by Aaron Koblin in 2006 by getting
10 000 turks to draw sheep.
104. http://blog.doloreslabs.com/2009/05/the-programming-
language-with-the-happiest-users/
Two people checking a subset of tweets can data mine twitter for you.
We used crowdsourcing to analyse all of the comments to PlOS articles.
105. But another more interesting version is to get people in interact directly with your data!
" stardust at home
" http://stardustathome.ssl.berkeley.edu/about.php
" http://folding.stanford.edu/
" http://fold.it/portal/
" citizen science blog
" http://citizensci.com/
" great backyard bird count
" http://www.birdsource.org/gbbc/
106. You need to make it engaging, like the Fold it Project, or Galaxy Zoo.
Even if machines and machine learning could answer some of these questions
(like image analysis of galaxy rotation), humans can do it now. You get the scientific
benefit now, you engage the public with science now.
107. Fold it
Stardust at home
Humans Article Writing
Peer to Patent Peer Review
Galaxy Zoo
Author Identification
Article Publishing
Science Blogging/Tweeting/Social Communities
Turk SIOC RDF
Public Academic
Entity Extraction
Seti at Home
Folding at home Machines
Now we have an interesting picture, but most of the arrows in this picture
point down. Where are the efforts to make computers more friendly to people?
One pointer to how that will happen in the future is Google Wave.
108. Google Wave
photo credit: flickr prgibbs
New product from Google, launching in September 09
For the definitive guide to google wave look at:
http://www.youtube.com/watch?v=v_UyVmITiYQ
110. Robot
App Engine
Gadget
html5
Embed Container
(blogger)
Of interest for developers are the APIʼs the wave exposes.
Naively one can think of Robots as allowing two way communication with
a wave, Gadgets for pulling content into a wave, and the Embed gadget
as a tool for pushing waves into other contexts, such as blogs or wikis.
111. Importantly Google intends to open source the server code allowing anyone to run a wave server, much as anyone can
run an email server.
112. Email Thread?
Document?
Game Server?
IM? Gallery?
Group?
? ?? ?
The metaphors for what wave is have not settled down yet.
This is a consequence of the current interface, new interfaces will be possible.
The key is that Wave enables exposing 3rd party APIʼs to the user in a
totally opaque way. It hides the details, and makes it easier for people
to interact with computers.
113. image credit sam brown, explodingdog
Finally we can live in a a world where computers and humans can be friends.
114. Fold it
Stardust at home
Humans Article Writing
Peer to Patent Peer Review
Galaxy Zoo
Author Identification
Article Publishing
Science Blogging/Tweeting/Social Communities
Turk SIOC RDF
Public Academic
WAVE
Entity Extraction
Seti at Home
Folding at home Machines
115. • http://code.google.com/p/helpmeigor/
• http://github.com/cameronneylon/ChemSpidey/
tree/master
• http://github.com/IanMulvany/janey-robot/tree/
master
Some scientific robots have already been created.
121. biological pathways
Text
http://www.reactome.org/
Itʼs a hard problem, some data sets are big and complicated.
http://www.reactome.org/ tries to visualise pathways in the
human genome.
123. • Publishers will continue to exist but will become
communication companies
• They must learn to treat the web as a network, not a
distribution channel
• Journals should be more like databases, and vice versa
• Publishing and broadcasting are merging (or colliding?); to
some extent, he same goes for publishing and software
• The disruptive forces include new economics, lower barriers
to entry, and a complex competitive environment
Final thoughts
Some predictions for scientific publishing.
124. • Mobile devices as sensors e.g. noisetube.net
• Rich web applications building on HTML 5 will be a real
competitor to the desktop
• The problem of scientific identity will be solved
• We will have a scientific recommendation engine that works
• Frameworks for programming genetic code, much like we
now program computer code, will be available
• Computers will do much of the heavy lifting of science
• http://www.nature.com/nature/focus/arts/futures
Final thoughts
Some predictions for science.
125. “The future is already here. It's just not very evenly distributed” - William Gibson
Sci Foo is an annual weekend un-conference that brings together people
doing interesting things at the interface between science, technology and culture.
Looking at what these people are doing gives us a hint of things to come.
127. Extra image
Acknowledgements
• http://www.flickr.com/people/matthewfield/ Matthew Field, Lots Of
People
• http://www.flickr.com/people/garthimage/ Garth Burgess, Southampton
Docks
• http://13c4.wordpress.com/ Pamela Bumstead, 50 reasons not to
• http://www.flickr.com/people/mayeve/ clock
• http://www.flickr.com/people/sublimelyhappy/ Sarah Gerke, Rolodex
• http://www.flickr.com/people/thedepartment/ Kate Andrews, Library
• http://www.flickr.com/people/sirstick/ Alexander Hauser, new mail
• http://commons.wikimedia.org/wiki/User:CJ The Thinker
• Gavin Bell, helpful discussions about OpenID