This document summarizes how Python is used for high throughput science at Diamond Light Source. It describes how Python has been implemented in their data acquisition tools, analysis workflows, data processing pipelines, and tomography reconstruction to handle the large data volumes and rates from their detectors and beamlines. Python modules, libraries like NumPy and SciPy, and frameworks like IPython and Dawn have been developed to make Python accessible for scientists and enable processing of big data on their clusters.
5. What do I do?
• Provide data analysis for use during and
after beamtime for users
–Users may or may not have any prior
experience.
–~30 beamlines with over 100 techniques
used.
• With 12 other Full time developers
6. Where it all started
Client server
technology
Communication with
EPICS and hardware
Scan mechanism
www.opengda.org
Jython
and Python
Visualisation
Communication
with external
analysis
Analysis
tools
All core technologies open source
Acquisition
• 1.0 release 2002
• 3.0 release 2004
– Jython introduced
as scripting
language
Beamline setup and
data collection speed
increased.
8. Detector History at DLS
• Early 2007:
– Diamond first user.
– No detector faster than ~10 MB/sec.
• Early 2009:
– first Lustre system (DDN S2A9900)
– first Pilatus 6M system @ 60 MB/s.
• Early 2011:
– second Lustre system (DDN SFA10K)
– first 25Hz Pilatus 6M system @150 MB/s.
• Early 2013:
– first GPFS system (DDN SFA12K)
– First 100 Hz Pilatus 6M system @ 600 MB/sec
– ~10 beamlines with 10 GbE detectors (mainly Pilatus and PCO Edge).
• Early 2015:
– delivery of Percival detector (6000 MB/sec).
1
10
100
1000
10000
2007 2012
Peak Detector
Performance (MB/s)
10. Data Storage
● ~1PB of Lustre
● ~1PB of GPFS
● ~0.5PB of on-line archive
● ~1PB near-line archive
– >200M files
High performance parallel
file systems HATE lots of
small files.
13. “I have all the data I have ever
collected on a floppy disk and
process it by hand…”
Principal beam-line scientist when asked
about data volumes in 2005
14. “I have all the data I have ever
collected on a floppy disk and
process it by hand…”
~1 TB so far this year
16. Processing Playing with Data (Variety)
• Experimental work requires exploring
– Matlab
– IDL
– IgorPro
– Excel
– Origin
– Mathmatica
• Issue is scalability at all and at a reasonable price
17. Clusters (Velocity)
●132 Intel based nodes, 1280
Intel cores in service.
●80 NVIDIA GPGPU’s, 23328
GPU cores in service.
●Split across 6 clusters, with a
range of capabilities.
●Mostly used by MX and
tomography beamlines.
●All accessed via Sun Grid
Engine interface.
18. Python is the Obvious answer
• Users have used it during their beam times.
• Free and easily distributable.
• ...
• BUT – how to give it to them in a way they
understand.
19. Extending the Acquisition tools
Client server
technology
Communication with
EPICS and hardware
Scan mechanism
www.opengda.org
Jython
and Python
Visualisation
Communication
with external
analysis
Analysis
tools
Data read, write,
convert
Metadata
structure
Workflows
All core technologies open source
www.dawnsci.org
DAWN is a
collection of
generic and
bespoke ‘views’
collated into
‘perspectives’.
The perspectives
and views can
be used in part
or whole in either
the GDA or
DAWN.
Acquisition Analysis
20. Main Dawn Elements for Python
Python/Jython
Data
Exploring
Workflow
PyDev Scripting
IPython Console
Python Actor scisoftpy module
HDF5
Visualisation
www.dawnsci.org
25. Processing Playing with Data (Variety)
• Experimental work requires exploring
– Python
• Scientific Software team
– Modules for easy access and common tasks
– Repositories and Training
26. Aside – Python for Optimization
• We produce a very fast beam of electrons
(99.999999% the speed of light)
• We oscillate this beam between magnet
arrays called Insertion Devices (ID’s) to
make lots of light
29. Simple Optimisation Problem
• From 800 magnets, pick 600 of them
in the right order so that they appear
to be a perfect array.
• But we already have code in Fortran
–Bit hard to use
–Not that extensible to new systems
30. Objective Functions
• Slower in Python than Fortran
–Original code ~ 1,000 times slower
–Numpy array optimised ~ 10 times
slower
• Python improvements,
–Caching ~ matched the speed
–Clever updating ~ 100 times faster.
31. OptID
• Artificial Immune systems
– Global optimiser
– Need more evaluations
• Parallelization
– Threading with np to use processors
– Mpi4py for data transfer and making use of the cluster
• Running on 25 machines, 200 cpu’s
• First sort with the new code has been built.
33. Archiving (Veracity)
• Simple task of registering files and metadata with a
remote service.
– Xml parsing
– Contact web services
– File system interaction
• Nearly 1PB of data and 200 Million files archives through
this system.
• Extended onto the cluster to deal with the additional
load.
35. MX Data Reduction (Volume)
Fast DP - fast
Index
Integrate
PointlessScale, refine in P1
Scale, postrefine, merge in point group
Choose best point group
Integrate Integrate Integrate Integrate
Output MTZ File
xia2 – thorough
downstream processing...
36. Experimental Phasing (Velocity)
Fast EP
Prepare for Shelx - ShelxC
Phase - ShelxE
Solvent fraction
Original
Inverted
Find substructure - ShelxD
#sites
Spacegroups
0.25 0.75
Experimentally phased map
Fast DP MTZ file
Results location: (visitpath)/processed/(folder)/(prefix)
37. DIALS
• Full application being built in Python
– 4 full time developers
• CCTBX
– Extending and working with this open source project
• Boost
– Optimization when required using Boost
39. Tomography Current Implemetation
• Existing codes for reconstruction in c with CUDA
– Only runs on Tiffs
– Minimal data correction for experimental artefacts
– Only uses 1GPU
• Python
– Splits data and manages cluster usage (2 GPU’s per
Node)
– Extracts corrected data from HDF
– Builds input files from metadata
40. Tomography Next Gen
• Mpi4py
– Cluster organisation,
– Parallelism
– Queues using send buffers
• Transfer of data using ZeroMQ
– Using blosc for compression
• Processing in python where possible
– But calls to external code will be used initially
43. Multiprocessor/MPI “profiling”
• Javascript
var dataTable = new google.visualization.DataTable()
• Python
import logging
logging.basicConfig(level=0,format='L
%(asctime)s.%(msecs)03d M' + machine_number_string +
' ' + rank_names[machine_rank] + ' %(levelname)-6s
%(message)s', datefmt='%H:%M:%S')
• Jinja2 templating to tie the 2 together
44. Where are we going?
• Scientists are having to become developers
– We try to steer them in the right direction
– Python is a very good, if not the best tool to do this
• Developers are having to work faster and be more
reactive to new detectors, clusters, software, methods,....
– Python allows this, and is being adopted almost as
standard by new computational projects at Diamond
45. Acknowledgements
– Alun Ashton
– Graeme Winter
– Greg Matthews
– Tina Friedrich
– Frederik Ferner
– Jonah Graham
(Kichwa)
– Matthew Gerring
– Peter Chang
– Baha El Kassaby
– Jacob Filik
– Karl Levik
– Irakli Sikharulidze
– Olof Svensson
– Andy Gotz
– Gábor Náray
– Ed Rial
– Robert Oates