If Big Data is data that exceeds the processing capacity of conventional systems, thereby necessitating alternative processing measures, we are looking at an essentially technological challenge that IT managers are best equipped to address.
The DCC is currently working with 18 HEIs to support and develop their capabilities in the management of research data and, whilst the aforementioned challenge is not usually core to their expressed concerns, are there particular issues of curation inherent to Big Data that might force a different perspective?
We have some understanding of Big Data from our contacts in the Astronomy and High Energy Physics domains, and the scale and speed of development in Genomics data generation is well known, but the inability to provide sufficient processing capacity is not one of their more frequent complaints.
That’s not to say that Big Science and its Big Data are free of challenges in data curation; only that they are shared with their lesser cousins, where one might say that the real challenge is less one of size than diversity and complexity.
This brief presentation explores those aspects of data curation that go beyond the challenges of processing power but which may lend a broader perspective to the technology selection process.
Strategies for Landing an Oracle DBA Job as a Fresher
Graham Pryor
1. Because good research needs good data
Big data
– no big deal for curation?
Graham Pryor, Associate Director, UK Digital Curation Centre
Eduserv Symposium 2012: Big Data, Big Deal?
.
This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License
2. Big data – big deal or same deal?
“What need the bridge much broader than the flood?
The fairest grant is the necessity.
Look, what will serve is fit…”
Much Ado About Nothing, Act 1 Scene 1
3. Eduserv Symposium 2012 –
speakers’ Research Areas
• Operating Systems & Networking
• Computer and Network Security
• Distributed Systems
• Mobile Computing
• Wireless Networking
• Software Engineering
• High performance compute clusters
• Cloud and grid technologies
• Effective management of large clusters and
cluster file-systems
• Very large database systems (architecture,
management and application optimization)
4. The Digital Curation Centre
• a consortium comprising units from the Universities of Bath
(UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII)
• launched 1st March 2004 as a national centre for solving
challenges in digital curation that could not be tackled by
any single institution or discipline
• funded by JISC to build capacity, capability and skills in
research data management across the UK HEI community
• awarded additional HEFCE funding 2011/13 for
• the provision of support to national cloud services
• targeted institutional development
5. Three perspectives
Scale and complexity
– Volume and pace
– Infrastructure
– Open science
Policy
– Funders
– Institutions
– Ethics & IP
Management
– Storage
– Incentives
– Costs & Sustainability
http://www.nonsolotigullio.com/effettiottici/images/escher.jpg/
6. Challenges of scale and complexity
• The virtual laboratory is a federation
of server nodes that allows
• Globally, >100,000
distributed data to be stored local to
neuroscientists study the
acquisition
CNS, generating massive,
• Analysis codes can be uploaded and
intricate and highly this is only talking
But terabytes…
executed on the nodes so that
interrelated datasets
derived datasets need not be
• Analysts require access to
transported over low bandwidth
these data to develop
connections
algorithms, models and
• Data and analysis codes are
schemata that characterise
described by structured metadata,
the underlying system
providing an index for search,
• Resources and actors are
annotation and audit over workflows
rarely collocated and are
leading to scientific outcomes
therefore difficult to combine.
• Users access the distributed
resources through a web portal
emulating a PC desktop
http://www.carmen.org.uk/
7. Big data? – The Large Hadron Collider
Searching for the Higgs Boson
• Predicted annual generation of around 15
petabytes (15 million gigabytes) of data
• Would need >1,700,000 dual layer DVDs
8. Big data – the GridPP solution
Crowd sourcing for the LHC
Home and“Withcomputer users
office GridPP you
can sign up to thenever have
need LHC at home
project (based at Queen Mary,
University those data
of London), which
processing blues
makes use of idle CPU time. So
far, 40,000again…”
users in more than 100
countries have contributed the
equivalent of 3000 years on a
http://www.gridpp.ac.uk/about
single computer to the project.
With the Large Hadron Collider running at CERN the grid is
being used to process the accompanying data deluge. The UK
grid is contributing more than the equivalent of 20,000 PCs to
this worldwide effort.
9. Yet…..Data Preservation in High
Energy Physics?
Data from high–energy physics (HEP)
experiments are collected with significant
financial and human effort and are in many
cases unique. At the same time, HEP has no
coherent strategy for data preservation and re–
use, and many important and complex data sets
are simply lost.
David M. South, on behalf of the ICFA DPHEP Study Group
arXiv:1101.3186v1 [hep-ex]
10. Big data in genomics
These studies are generating
valuable datasets which, due to
their size and complexity, need to
be skilfully managed…
11. There’s a bigger deal than big data…
Socio- 2.
technical • Inventory data assets
management
perspectives • Profile norms, roles,
• Identify drivers and
values
champions
• Identify capability gaps
• Analyse stakeholders,
• Analyse current
issues
Information workflows
• Identify capability systems
gaps perspectives
• Assess costs,
benefits, risks
3.
Research
practice • Produce feasible,
perspectives desirable changes
• Evaluate fitness for
purpose
Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012
12. The DCC - building capacity and capability
through targeted institutional development
• 18 institutional engagements, 14 roadshows
• advice and assistance in strategy and policy
• use of curation tools for audit and planning
• training and skills transfer
13. Why do we do this?
1. Reports that researchers are often unaware
of threats and opportunities
14. http://www.flickr.com/photos/mattimattila/3003324844/
“Departments don’t have guidelines or
norms for personal back-up and researcher
procedure, knowledge and diligence varies
tremendously. Many have experienced
moderate to catastrophic data loss”
Incremental Project Report, June 2010
15. Why do we do this?
1. Reports that researchers are often unaware
of threats and opportunities
2. There is a lack of clarity in terms of skills
availability and acquisition
16. …researchers are
reluctant to adopt new tools and
services unless they know
someone who can recommend
or share knowledge about
them. Support needs to be
based on a close understanding
of the researchers’ work, its
patterns and timetables.
17. Why do we do this?
1. Reports that researchers are often unaware
of threats and opportunities
2. There is a lack of clarity in terms of skills
availability and acquisition
3. Many institutions are unprepared to meet
the increasingly prescriptive demands of
funders
18. EPSRC expects all those institutions it funds
• to have developed a roadmap aligning their policies
and processes with EPSRC’s nine expectations by
1st May 2012
• to be fully compliant with each of those expectations
by 1st May 2015
• to recognise that compliance will be monitored and
non-compliance investigated and that
• failure to share research data could result in the
imposition of sanctions
19. Why do we do this?
1. Reports that researchers are often unaware
of threats and opportunities
2. There is a lack of clarity in terms of skills
availability and acquisition
3. Many institutions are unprepared to meet
the increasingly prescriptive demands of
funders
4. …and legislators
20. Rules and regulations…
Compliance
Data Protection Act
1998
• Rights, Exemptions, Enforcement
Freedom of • Climategate, Tree Rings, Tobacco
Information Act 2000 and…(what’s next?)
Computer Misuse Act
1980
• etc. etc. etc………..
21. Why do we do this?
1. Reports that researchers are often unaware
of threats and opportunities
2. There is a lack of clarity in terms of skills
availability and acquisition
3. Many institutions are unprepared to meet
the increasingly prescriptive demands of
funders
4. …and legislators
5. The advantages from planning, openness
and sharing are not understood
22. Open to all? Case studies of openness
in research
Choices are made according to context, with
degrees of openness reached according to:
• The kinds of data to be made available
• The stage in the research process
• The groups to whom data will be made
available
• On what terms and conditions it will be
provided
Default position of most:
• YES to protocols, software, analysis tools,
methods and techniques
• NO to making research data content freely
available to everyone
After all, where is the incentive? Angus Whyte, RIN/NESTA, 2010
24. Main institutional concerns
And big data? There has been no mention
– Compliance
yet of any specific challenge from big data
– Asset management
but…
– Cost benefits
– Incentivisation
Institutions are providing resources to work
onComplexity of the data environment
– big data, both equipment and people,
and more importantly…
…the issues central to effective data
management are common across the data
spectrum, irrespective of size
25. Some current institutional engagements
Assessing Piloting tools
needs e.g. DataFlow
RDM roadmaps
Policy Policy
development implementation
26. Support offered by the DCC
Institutional
Assess data catalogues
needs Workflow
assessment Pilot RDM
tools
Develop
DAF & CARDIO DCC
assessments Guidance support
support
team and training and
services
RDM policy
Advocacy to senior development
management
Customised Data
Make the case Management Plans
…and support policy implementation
28. Your Data as Assets: DAF
• What are the characteristics of your
research data assets?
– Number?
– Scale?
– Complexity?
– Dependencies?
– Liabilities?
• Why do researchers act the way they do
with respect to data?
• Which data do they need to undertake
productive research?
29. DMP Online is a web-based data management
planning tool that allows you to build and edit plans
according to the requirements of the major UK
funders.
The tool also contains helpful guidance and links for
researchers and other data professionals.
http://www.dcc.ac.uk/dmponline
30. An online tool for departments or research groups to
identify their current data management capabilities
and identify coordinated pathways to future
enhancement via a dedicated knowledge base.
CARDIO emphasises a collaborative, consensus-
driven approach, and enables benchmarking with
other groups and institutions.
http://cardio.dcc.ac.uk/
31. DRAMBORA is an audit methodology and tool for
identifying and planning for the management of risks
which may threaten the availability and/or usability of
content in a digital repository or archive.
http://www.repositoryaudit.eu
32. So, big data
– no big deal for curation?
• Yes, it’s big
• It’s also very complex
• There is no single technology solution
• Issues of human infrastructure are
possibly a bigger challenge
• But for big data aficionados the
technology challenges are big enough
33. Data Management – infrastructure
and data storage challenges...
Scaleability
Cost-effectiveness
Security (privacy and IPR)
Robust and resilient
Low entry barrier
Ease-of-use
Data-handling / transfer /
analysis capabilities
The case for cloud computing in genome informatics.
Lincoln D Stein, May 2010