SlideShare uma empresa Scribd logo
1 de 34
Baixar para ler offline
Because good research needs good data




Big data
– no big deal for curation?
Graham Pryor, Associate Director, UK Digital Curation Centre

Eduserv Symposium 2012: Big Data, Big Deal?

                                                                                                .

          This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License
Big data – big deal or same deal?
“What need the bridge much broader than the flood?
The fairest grant is the necessity.
Look, what will serve is fit…”
                        Much Ado About Nothing, Act 1 Scene 1
Eduserv Symposium 2012 –
      speakers’ Research Areas
•   Operating Systems & Networking
•   Computer and Network Security
•   Distributed Systems
•   Mobile Computing
•   Wireless Networking
•   Software Engineering
             • High performance compute clusters
             • Cloud and grid technologies
             • Effective management of large clusters and
               cluster file-systems
             • Very large database systems (architecture,
               management and application optimization)
The Digital Curation Centre
• a consortium comprising units from the Universities of Bath
  (UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII)
• launched 1st March 2004 as a national centre for solving
  challenges in digital curation that could not be tackled by
  any single institution or discipline
• funded by JISC to build capacity, capability and skills in
  research data management across the UK HEI community
• awarded additional HEFCE funding 2011/13 for
   • the provision of support to national cloud services
   • targeted institutional development
Three perspectives
 Scale and complexity
   – Volume and pace
   – Infrastructure
   – Open science
 Policy
   – Funders
   – Institutions
   – Ethics & IP
 Management
   – Storage
   – Incentives
   – Costs & Sustainability
                              http://www.nonsolotigullio.com/effettiottici/images/escher.jpg/
Challenges of scale and complexity
           • The virtual laboratory is a federation
              of server nodes that allows
• Globally, >100,000
              distributed data to be stored local to
  neuroscientists study the
              acquisition
  CNS, generating massive,
           • Analysis codes can be uploaded and
  intricate and highly this is only talking
                  But                                  terabytes…
              executed on the nodes so that
  interrelated datasets
              derived datasets need not be
• Analysts require access to
              transported over low bandwidth
  these data to develop
              connections
  algorithms, models and
           • Data and analysis codes are
  schemata that characterise
              described by structured metadata,
  the underlying system
              providing an index for search,
• Resources and actors are
              annotation and audit over workflows
  rarely collocated and are
              leading to scientific outcomes
  therefore difficult to combine.
           • Users access the distributed
              resources through a web portal
              emulating a PC desktop
                                               http://www.carmen.org.uk/
Big data? – The Large Hadron Collider




                                 Searching for the Higgs Boson




 • Predicted annual generation of around 15
   petabytes (15 million gigabytes) of data
 • Would need >1,700,000 dual layer DVDs
Big data – the GridPP solution
                             Crowd sourcing for the LHC
                             Home and“Withcomputer users
                                         office GridPP you
                             can sign up to thenever have
                                        need LHC at home
                             project (based at Queen Mary,
                             University those data
                                        of London), which
                                        processing blues
                             makes use of idle CPU time. So
                             far, 40,000again…”
                                         users in more than 100
                             countries have contributed the
                             equivalent of 3000 years on a
                                        http://www.gridpp.ac.uk/about
                             single computer to the project.
With the Large Hadron Collider running at CERN the grid is
being used to process the accompanying data deluge. The UK
grid is contributing more than the equivalent of 20,000 PCs to
this worldwide effort.
Yet…..Data Preservation in High
Energy Physics?
Data from high–energy physics (HEP)
experiments are collected with significant
financial and human effort and are in many
cases unique. At the same time, HEP has no
coherent strategy for data preservation and re–
use, and many important and complex data sets
are simply lost.
David M. South, on behalf of the ICFA DPHEP Study Group
arXiv:1101.3186v1 [hep-ex]
Big data in genomics



   These studies are generating
   valuable datasets which, due to
   their size and complexity, need to
   be skilfully managed…
There’s a bigger deal than big data…
                                          Socio-                    2.
                                        technical                   • Inventory data assets
                                       management
                                       perspectives                 • Profile norms, roles,
• Identify drivers and
                                                                       values
  champions
                                                                    • Identify capability gaps
• Analyse stakeholders,
                                                                    • Analyse current
  issues
                             Information                               workflows
• Identify capability          systems
  gaps                      perspectives
• Assess costs,
  benefits, risks
                                                                    3.
                                       Research
                                        practice                    • Produce feasible,
                                      perspectives                    desirable changes
                                                                    • Evaluate fitness for
                                                                      purpose

                   Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012
The DCC - building capacity and capability
through targeted institutional development
•   18 institutional engagements, 14 roadshows
•   advice and assistance in strategy and policy
•   use of curation tools for audit and planning
•   training and skills transfer
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
http://www.flickr.com/photos/mattimattila/3003324844/




       “Departments don’t have guidelines or
   norms for personal back-up and researcher
   procedure, knowledge and diligence varies
       tremendously. Many have experienced
          moderate to catastrophic data loss”
Incremental Project Report, June 2010
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
2. There is a lack of clarity in terms of skills
   availability and acquisition
…researchers are
reluctant to adopt new tools and
services unless they know
someone who can recommend
or share knowledge about
them. Support needs to be
based on a close understanding
of the researchers’ work, its
patterns and timetables.
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
2. There is a lack of clarity in terms of skills
   availability and acquisition
3. Many institutions are unprepared to meet
   the increasingly prescriptive demands of
   funders
EPSRC expects all those institutions it funds
• to have developed a roadmap aligning their policies
  and processes with EPSRC’s nine expectations by
  1st May 2012
• to be fully compliant with each of those expectations
  by 1st May 2015
• to recognise that compliance will be monitored and
  non-compliance investigated and that
• failure to share research data could result in the
  imposition of sanctions
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
2. There is a lack of clarity in terms of skills
   availability and acquisition
3. Many institutions are unprepared to meet
   the increasingly prescriptive demands of
   funders
4. …and legislators
Rules and regulations…


    Compliance

 Data Protection Act
        1998
                       • Rights, Exemptions, Enforcement

Freedom of             • Climategate, Tree Rings, Tobacco
Information Act 2000     and…(what’s next?)

Computer Misuse Act
      1980
                    • etc. etc. etc………..
Why do we do this?
1. Reports that researchers are often unaware
   of threats and opportunities
2. There is a lack of clarity in terms of skills
   availability and acquisition
3. Many institutions are unprepared to meet
   the increasingly prescriptive demands of
   funders
4. …and legislators
5. The advantages from planning, openness
   and sharing are not understood
Open to all? Case studies of openness
in research
Choices are made according to context, with
degrees of openness reached according to:
• The kinds of data to be made available
• The stage in the research process
• The groups to whom data will be made
  available
• On what terms and conditions it will be
  provided

Default position of most:
• YES to protocols, software, analysis tools,
  methods and techniques
• NO to making research data content freely
  available to everyone

After all, where is the incentive?              Angus Whyte, RIN/NESTA, 2010
DCC
Institutional
Engagements




http://www.dcc.ac.uk/community/institutional-engagements
                      Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012
Main institutional concerns
And big data? There has been no mention
– Compliance
yet of any specific challenge from big data
– Asset management
but…
– Cost benefits
– Incentivisation
Institutions are providing resources to work
onComplexity of the data environment
– big data, both equipment and people,
and more importantly…
…the issues central to effective data
management are common across the data
spectrum, irrespective of size
Some current institutional engagements
          Assessing                  Piloting tools
              needs                  e.g. DataFlow


                      RDM roadmaps




    Policy                                     Policy
 development                               implementation
Support offered by the DCC
                              Institutional
Assess                      data catalogues
needs         Workflow
             assessment                  Pilot RDM
                                            tools
                                                             Develop
   DAF & CARDIO            DCC
    assessments                                Guidance      support
                          support
                           team               and training     and
                                                             services
                                         RDM policy
   Advocacy to senior                   development
     management
                           Customised Data
         Make the case    Management Plans

                             …and support policy implementation
Four DCC Tools
Your Data as Assets: DAF
• What are the characteristics of your
  research data assets?
  –   Number?
  –   Scale?
  –   Complexity?
  –   Dependencies?
  –   Liabilities?
• Why do researchers act the way they do
  with respect to data?
• Which data do they need to undertake
  productive research?
DMP Online is a web-based data management
planning tool that allows you to build and edit plans
according to the requirements of the major UK
funders.

The tool also contains helpful guidance and links for
researchers and other data professionals.

http://www.dcc.ac.uk/dmponline
An online tool for departments or research groups to
identify their current data management capabilities
and identify coordinated pathways to future
enhancement via a dedicated knowledge base.

CARDIO emphasises a collaborative, consensus-
driven approach, and enables benchmarking with
other groups and institutions.

http://cardio.dcc.ac.uk/
DRAMBORA is an audit methodology and tool for
identifying and planning for the management of risks
which may threaten the availability and/or usability of
content in a digital repository or archive.

http://www.repositoryaudit.eu
So, big data
– no big deal for curation?
• Yes, it’s big
• It’s also very complex
• There is no single technology solution
• Issues of human infrastructure are
  possibly a bigger challenge
• But for big data aficionados the
  technology challenges are big enough
Data Management – infrastructure
and data storage challenges...
Scaleability
Cost-effectiveness
Security (privacy and IPR)
Robust and resilient
Low entry barrier
Ease-of-use
Data-handling / transfer /
analysis capabilities
         The case for cloud computing in genome informatics.
         Lincoln D Stein, May 2010
Help desk:
0131 651 1239

info@dcc.ac.uk

www.dcc.ac.uk

Mais conteúdo relacionado

Mais procurados

Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
TERN Australia
 
TERN data sharing at TRY workshop
TERN data sharing at TRY workshopTERN data sharing at TRY workshop
TERN data sharing at TRY workshop
TERN Australia
 
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic AnalysisII-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
Dr. Haxel Consult
 
Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphere
Alex Hardisty
 

Mais procurados (20)

Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster Research Data Management, Challenges and Tools - Per Öster
Research Data Management, Challenges and Tools - Per Öster
 
Research Data Management at Imperial College London
Research Data Management at Imperial College LondonResearch Data Management at Imperial College London
Research Data Management at Imperial College London
 
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
OpenData Public Research, University of Toronto, Open Access Week, 25/11/2011
 
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data MeetupCrowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
 
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital TextsCase Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
Case Study Big Data: Socio-Technical Issues of HathiTrust Digital Texts
 
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
Stuart Phinn_Many kinds of infrastructure: resolving and advancing ecosystem ...
 
Facilitating Scientific Collaborations by Delegating Identity Management
Facilitating Scientific Collaborations by Delegating Identity ManagementFacilitating Scientific Collaborations by Delegating Identity Management
Facilitating Scientific Collaborations by Delegating Identity Management
 
Sgci nsf-2-22-17
Sgci nsf-2-22-17Sgci nsf-2-22-17
Sgci nsf-2-22-17
 
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
 
TERN data sharing at TRY workshop
TERN data sharing at TRY workshopTERN data sharing at TRY workshop
TERN data sharing at TRY workshop
 
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic AnalysisII-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
II-SDV 2012 Patent Prior-Art Searching with Latent Semantic Analysis
 
Guy avoiding-dat apocalypse
Guy avoiding-dat apocalypseGuy avoiding-dat apocalypse
Guy avoiding-dat apocalypse
 
DEVELOPING A KNOWLEDGE MANAGEMENT SPIRAL FOR THE LONG-TERM PRESERVATION SYSTE...
DEVELOPING A KNOWLEDGE MANAGEMENT SPIRAL FOR THE LONG-TERM PRESERVATION SYSTE...DEVELOPING A KNOWLEDGE MANAGEMENT SPIRAL FOR THE LONG-TERM PRESERVATION SYSTE...
DEVELOPING A KNOWLEDGE MANAGEMENT SPIRAL FOR THE LONG-TERM PRESERVATION SYSTE...
 
Imaging dearry ncrdc 11062017
Imaging dearry ncrdc  11062017Imaging dearry ncrdc  11062017
Imaging dearry ncrdc 11062017
 
SKA NZ R&D BeSTGRID Infrastructure
SKA NZ R&D BeSTGRID InfrastructureSKA NZ R&D BeSTGRID Infrastructure
SKA NZ R&D BeSTGRID Infrastructure
 
Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521Introduction to research data management; Lecture 01 for GRAD521
Introduction to research data management; Lecture 01 for GRAD521
 
Research Data Management and Librarians
Research Data Management and LibrariansResearch Data Management and Librarians
Research Data Management and Librarians
 
Building the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsBuilding the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of Scientists
 
Data accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphereData accessibility and the role of informatics in predicting the biosphere
Data accessibility and the role of informatics in predicting the biosphere
 
Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data management
 

Destaque

Wayfs and Strays - Jonathan Richardson
Wayfs and Strays - Jonathan RichardsonWayfs and Strays - Jonathan Richardson
Wayfs and Strays - Jonathan Richardson
Eduserv
 
Identity & Access Management Update - David Orrell
Identity & AccessManagement Update - David OrrellIdentity & AccessManagement Update - David Orrell
Identity & Access Management Update - David Orrell
Eduserv
 

Destaque (14)

Maple University of Waterloo case study
Maple University of Waterloo case studyMaple University of Waterloo case study
Maple University of Waterloo case study
 
Owain Davies - The value of syndicating health information - an NHS case study
Owain Davies - The value of syndicating health information - an NHS case studyOwain Davies - The value of syndicating health information - an NHS case study
Owain Davies - The value of syndicating health information - an NHS case study
 
SharePoint in Higher Education Institutions
SharePoint in Higher Education InstitutionsSharePoint in Higher Education Institutions
SharePoint in Higher Education Institutions
 
Security radar for 2014
Security radar for 2014Security radar for 2014
Security radar for 2014
 
Wayfs and Strays - Jonathan Richardson
Wayfs and Strays - Jonathan RichardsonWayfs and Strays - Jonathan Richardson
Wayfs and Strays - Jonathan Richardson
 
The Eduserv Cloud: Who, What, Why, When and Where?
The Eduserv Cloud: Who, What, Why, When and Where?The Eduserv Cloud: Who, What, Why, When and Where?
The Eduserv Cloud: Who, What, Why, When and Where?
 
UMF Cloud Pilot
UMF Cloud PilotUMF Cloud Pilot
UMF Cloud Pilot
 
Practically applying agile
Practically applying agilePractically applying agile
Practically applying agile
 
The role of a University Computing Service in an increasingly mobile world OR...
The role of a University Computing Service in an increasingly mobile world OR...The role of a University Computing Service in an increasingly mobile world OR...
The role of a University Computing Service in an increasingly mobile world OR...
 
The Molly Project & Mobile Oxford
The Molly Project & Mobile OxfordThe Molly Project & Mobile Oxford
The Molly Project & Mobile Oxford
 
Design Patterns for Digital Identity
Design Patterns for Digital IdentityDesign Patterns for Digital Identity
Design Patterns for Digital Identity
 
Identity & Access Management Update - David Orrell
Identity & AccessManagement Update - David OrrellIdentity & AccessManagement Update - David Orrell
Identity & Access Management Update - David Orrell
 
Beyond Library eResources: Using OpenAthens for Enterprise Security
Beyond Library eResources: Using OpenAthens for Enterprise SecurityBeyond Library eResources: Using OpenAthens for Enterprise Security
Beyond Library eResources: Using OpenAthens for Enterprise Security
 
Case study: Building a business case for cloud, migration in practice and spr...
Case study: Building a business case for cloud, migration in practice and spr...Case study: Building a business case for cloud, migration in practice and spr...
Case study: Building a business case for cloud, migration in practice and spr...
 

Semelhante a Graham Pryor

CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECAProject
 

Semelhante a Graham Pryor (20)

Big Data
Big Data Big Data
Big Data
 
Supporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data ManagementSupporting Libraries in Leading the Way in Research Data Management
Supporting Libraries in Leading the Way in Research Data Management
 
High Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run TimeHigh Performance Data Analytics and a Java Grande Run Time
High Performance Data Analytics and a Java Grande Run Time
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
 
NIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWGNIST Big Data Public Working Group NBD-PWG
NIST Big Data Public Working Group NBD-PWG
 
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
Sirris innovate2011 - Smart Products with smart data - introduction, Dr. Elen...
 
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
2013 DataCite Summer Meeting - DOIs and Supercomputing (Terry Jones - Oak Rid...
 
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...Meeting Federal Research Requirements for Data Management Plans, Public Acces...
Meeting Federal Research Requirements for Data Management Plans, Public Acces...
 
The e-Ciber Superfacility Project
The e-Ciber Superfacility ProjectThe e-Ciber Superfacility Project
The e-Ciber Superfacility Project
 
Managing and Sharing Research Data
Managing and Sharing Research DataManaging and Sharing Research Data
Managing and Sharing Research Data
 
Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant Application
 
Creating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant ApplicationCreating a Data Management Plan for your Grant Application
Creating a Data Management Plan for your Grant Application
 
DBMS
DBMSDBMS
DBMS
 
Intro to RDM
Intro to RDMIntro to RDM
Intro to RDM
 
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
The Paradigm of Fog Computing with Bio-inspired Search Methods and the “5Vs” ...
 
Ticer summer school_24_aug06
Ticer summer school_24_aug06Ticer summer school_24_aug06
Ticer summer school_24_aug06
 
Introduction to research data management
Introduction to research data managementIntroduction to research data management
Introduction to research data management
 
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
PIDs, Data and Software: How Libraries Can Support Researchers in an Evolving...
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDATResearch Data Management: An Introductory Webinar from OpenAIRE and EUDAT
Research Data Management: An Introductory Webinar from OpenAIRE and EUDAT
 

Mais de Eduserv

Mais de Eduserv (20)

Phase two of OpenAthens SP evolution including OpenID connect option
Phase two of OpenAthens SP evolution including OpenID connect optionPhase two of OpenAthens SP evolution including OpenID connect option
Phase two of OpenAthens SP evolution including OpenID connect option
 
Partnership Licensing - allowing access to licensed resources
Partnership Licensing - allowing access to licensed resources Partnership Licensing - allowing access to licensed resources
Partnership Licensing - allowing access to licensed resources
 
Lightning talk - EBSCO
Lightning talk - EBSCOLightning talk - EBSCO
Lightning talk - EBSCO
 
Lightning talk - Boopsie
Lightning talk - BoopsieLightning talk - Boopsie
Lightning talk - Boopsie
 
Lightning talk - Softlink
Lightning talk - SoftlinkLightning talk - Softlink
Lightning talk - Softlink
 
Lightning talk - Third Iron BrowZine
Lightning talk - Third Iron BrowZineLightning talk - Third Iron BrowZine
Lightning talk - Third Iron BrowZine
 
Lightning talk - Eduserv Chest Agreements
Lightning talk - Eduserv Chest AgreementsLightning talk - Eduserv Chest Agreements
Lightning talk - Eduserv Chest Agreements
 
Phase one of OpenAthens SP evolution
Phase one of OpenAthens SP evolutionPhase one of OpenAthens SP evolution
Phase one of OpenAthens SP evolution
 
Key considerations when mapping your end user experience
Key considerations when mapping your end user experienceKey considerations when mapping your end user experience
Key considerations when mapping your end user experience
 
Our product development methodology
Our product development methodologyOur product development methodology
Our product development methodology
 
How Readers Discover Content
How Readers Discover ContentHow Readers Discover Content
How Readers Discover Content
 
OpenAthens product update
OpenAthens product updateOpenAthens product update
OpenAthens product update
 
OpenAthens Customer Conference - Welcome address
OpenAthens Customer Conference - Welcome addressOpenAthens Customer Conference - Welcome address
OpenAthens Customer Conference - Welcome address
 
Generating leads with content marketing
Generating leads with content marketingGenerating leads with content marketing
Generating leads with content marketing
 
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
Pre-launch introduction to the new OpenAthens SP dashboard - 13/09/2016
 
Mobius from Maplesoft
Mobius from MaplesoftMobius from Maplesoft
Mobius from Maplesoft
 
QSR NVivo
QSR NVivo QSR NVivo
QSR NVivo
 
How Eduserv are helping local government organisations
How Eduserv are helping local government organisationsHow Eduserv are helping local government organisations
How Eduserv are helping local government organisations
 
Is cloud the right fit for your needs?
Is cloud the right fit for your needs?Is cloud the right fit for your needs?
Is cloud the right fit for your needs?
 
Planning your cloud strategy: Adur and Worthing Councils
Planning your cloud strategy: Adur and Worthing CouncilsPlanning your cloud strategy: Adur and Worthing Councils
Planning your cloud strategy: Adur and Worthing Councils
 

Último

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Último (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Graham Pryor

  • 1. Because good research needs good data Big data – no big deal for curation? Graham Pryor, Associate Director, UK Digital Curation Centre Eduserv Symposium 2012: Big Data, Big Deal? . This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License
  • 2. Big data – big deal or same deal? “What need the bridge much broader than the flood? The fairest grant is the necessity. Look, what will serve is fit…” Much Ado About Nothing, Act 1 Scene 1
  • 3. Eduserv Symposium 2012 – speakers’ Research Areas • Operating Systems & Networking • Computer and Network Security • Distributed Systems • Mobile Computing • Wireless Networking • Software Engineering • High performance compute clusters • Cloud and grid technologies • Effective management of large clusters and cluster file-systems • Very large database systems (architecture, management and application optimization)
  • 4. The Digital Curation Centre • a consortium comprising units from the Universities of Bath (UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII) • launched 1st March 2004 as a national centre for solving challenges in digital curation that could not be tackled by any single institution or discipline • funded by JISC to build capacity, capability and skills in research data management across the UK HEI community • awarded additional HEFCE funding 2011/13 for • the provision of support to national cloud services • targeted institutional development
  • 5. Three perspectives Scale and complexity – Volume and pace – Infrastructure – Open science Policy – Funders – Institutions – Ethics & IP Management – Storage – Incentives – Costs & Sustainability http://www.nonsolotigullio.com/effettiottici/images/escher.jpg/
  • 6. Challenges of scale and complexity • The virtual laboratory is a federation of server nodes that allows • Globally, >100,000 distributed data to be stored local to neuroscientists study the acquisition CNS, generating massive, • Analysis codes can be uploaded and intricate and highly this is only talking But terabytes… executed on the nodes so that interrelated datasets derived datasets need not be • Analysts require access to transported over low bandwidth these data to develop connections algorithms, models and • Data and analysis codes are schemata that characterise described by structured metadata, the underlying system providing an index for search, • Resources and actors are annotation and audit over workflows rarely collocated and are leading to scientific outcomes therefore difficult to combine. • Users access the distributed resources through a web portal emulating a PC desktop http://www.carmen.org.uk/
  • 7. Big data? – The Large Hadron Collider Searching for the Higgs Boson • Predicted annual generation of around 15 petabytes (15 million gigabytes) of data • Would need >1,700,000 dual layer DVDs
  • 8. Big data – the GridPP solution Crowd sourcing for the LHC Home and“Withcomputer users office GridPP you can sign up to thenever have need LHC at home project (based at Queen Mary, University those data of London), which processing blues makes use of idle CPU time. So far, 40,000again…” users in more than 100 countries have contributed the equivalent of 3000 years on a http://www.gridpp.ac.uk/about single computer to the project. With the Large Hadron Collider running at CERN the grid is being used to process the accompanying data deluge. The UK grid is contributing more than the equivalent of 20,000 PCs to this worldwide effort.
  • 9. Yet…..Data Preservation in High Energy Physics? Data from high–energy physics (HEP) experiments are collected with significant financial and human effort and are in many cases unique. At the same time, HEP has no coherent strategy for data preservation and re– use, and many important and complex data sets are simply lost. David M. South, on behalf of the ICFA DPHEP Study Group arXiv:1101.3186v1 [hep-ex]
  • 10. Big data in genomics These studies are generating valuable datasets which, due to their size and complexity, need to be skilfully managed…
  • 11. There’s a bigger deal than big data… Socio- 2. technical • Inventory data assets management perspectives • Profile norms, roles, • Identify drivers and values champions • Identify capability gaps • Analyse stakeholders, • Analyse current issues Information workflows • Identify capability systems gaps perspectives • Assess costs, benefits, risks 3. Research practice • Produce feasible, perspectives desirable changes • Evaluate fitness for purpose Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012
  • 12. The DCC - building capacity and capability through targeted institutional development • 18 institutional engagements, 14 roadshows • advice and assistance in strategy and policy • use of curation tools for audit and planning • training and skills transfer
  • 13. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities
  • 14. http://www.flickr.com/photos/mattimattila/3003324844/ “Departments don’t have guidelines or norms for personal back-up and researcher procedure, knowledge and diligence varies tremendously. Many have experienced moderate to catastrophic data loss” Incremental Project Report, June 2010
  • 15. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities 2. There is a lack of clarity in terms of skills availability and acquisition
  • 16. …researchers are reluctant to adopt new tools and services unless they know someone who can recommend or share knowledge about them. Support needs to be based on a close understanding of the researchers’ work, its patterns and timetables.
  • 17. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities 2. There is a lack of clarity in terms of skills availability and acquisition 3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders
  • 18. EPSRC expects all those institutions it funds • to have developed a roadmap aligning their policies and processes with EPSRC’s nine expectations by 1st May 2012 • to be fully compliant with each of those expectations by 1st May 2015 • to recognise that compliance will be monitored and non-compliance investigated and that • failure to share research data could result in the imposition of sanctions
  • 19. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities 2. There is a lack of clarity in terms of skills availability and acquisition 3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders 4. …and legislators
  • 20. Rules and regulations… Compliance Data Protection Act 1998 • Rights, Exemptions, Enforcement Freedom of • Climategate, Tree Rings, Tobacco Information Act 2000 and…(what’s next?) Computer Misuse Act 1980 • etc. etc. etc………..
  • 21. Why do we do this? 1. Reports that researchers are often unaware of threats and opportunities 2. There is a lack of clarity in terms of skills availability and acquisition 3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders 4. …and legislators 5. The advantages from planning, openness and sharing are not understood
  • 22. Open to all? Case studies of openness in research Choices are made according to context, with degrees of openness reached according to: • The kinds of data to be made available • The stage in the research process • The groups to whom data will be made available • On what terms and conditions it will be provided Default position of most: • YES to protocols, software, analysis tools, methods and techniques • NO to making research data content freely available to everyone After all, where is the incentive? Angus Whyte, RIN/NESTA, 2010
  • 23. DCC Institutional Engagements http://www.dcc.ac.uk/community/institutional-engagements Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012
  • 24. Main institutional concerns And big data? There has been no mention – Compliance yet of any specific challenge from big data – Asset management but… – Cost benefits – Incentivisation Institutions are providing resources to work onComplexity of the data environment – big data, both equipment and people, and more importantly… …the issues central to effective data management are common across the data spectrum, irrespective of size
  • 25. Some current institutional engagements Assessing Piloting tools needs e.g. DataFlow RDM roadmaps Policy Policy development implementation
  • 26. Support offered by the DCC Institutional Assess data catalogues needs Workflow assessment Pilot RDM tools Develop DAF & CARDIO DCC assessments Guidance support support team and training and services RDM policy Advocacy to senior development management Customised Data Make the case Management Plans …and support policy implementation
  • 28. Your Data as Assets: DAF • What are the characteristics of your research data assets? – Number? – Scale? – Complexity? – Dependencies? – Liabilities? • Why do researchers act the way they do with respect to data? • Which data do they need to undertake productive research?
  • 29. DMP Online is a web-based data management planning tool that allows you to build and edit plans according to the requirements of the major UK funders. The tool also contains helpful guidance and links for researchers and other data professionals. http://www.dcc.ac.uk/dmponline
  • 30. An online tool for departments or research groups to identify their current data management capabilities and identify coordinated pathways to future enhancement via a dedicated knowledge base. CARDIO emphasises a collaborative, consensus- driven approach, and enables benchmarking with other groups and institutions. http://cardio.dcc.ac.uk/
  • 31. DRAMBORA is an audit methodology and tool for identifying and planning for the management of risks which may threaten the availability and/or usability of content in a digital repository or archive. http://www.repositoryaudit.eu
  • 32. So, big data – no big deal for curation? • Yes, it’s big • It’s also very complex • There is no single technology solution • Issues of human infrastructure are possibly a bigger challenge • But for big data aficionados the technology challenges are big enough
  • 33. Data Management – infrastructure and data storage challenges... Scaleability Cost-effectiveness Security (privacy and IPR) Robust and resilient Low entry barrier Ease-of-use Data-handling / transfer / analysis capabilities The case for cloud computing in genome informatics. Lincoln D Stein, May 2010
  • 34. Help desk: 0131 651 1239 info@dcc.ac.uk www.dcc.ac.uk