1. EUDAT 2018 Conference
Porto, 24th January 2018
Overall View of the EUXDAT Project
F. Javier Nieto, Miguel Ángel Esbrí - ATOS
This project has received funding from the European Union’s
Horizon 2020 research and innovation programme under
grant agreement No 777549
www.EUXDAT.eu
European e-Infrastructure
for Extreme Data Analytics
in Sustainable Development
2. 2http://www.euxdat.eu/
1. Project & Consortium
2. Objectives & Context
3. Data Sources & Pilots
4. Architecture
5. Data Analysis and Visualization Tools
3. 3http://www.euxdat.eu/
Project & Consortium
EUXDAT: European e-Infrastructure for Extreme Analytics in
Sustainable Development
Topic: EINFRA-21-2017 Platform e-infrastructure innovation
Duration: 3 years (started in 11/2017)
Budget: 3 M €
9 partners
Industry: ATOSES*, ATOSFR
University: USTTUT
Research: CERTH, WRLS
SMEs: m-blue, PI
End-users: P4A, CoO
4. 4http://www.euxdat.eu/
Objectives & Context
“…EUXDAT will build up a Large Data Analytics-as-a-Service
e-Infrastructure
by connecting extremely large and heterogeneous data sources,
expertise from various disciplines and results from past projects
in order to provide analytics tools and services for diverse end-
users
supporting sustainable and productive agriculture, soil and
water protection, regional biodiversity and green infrastructure
sustainable development…”
5. 5http://www.euxdat.eu/
Objectives & Context
O1: Manage data storage and movement + Support
heterogeneous data sources + Configurable policies
O2: Adapt data processing tools for HPC + Improved users’
Portal + New resources management (hybrid HPC+Cloud)
O3: Service activities (access to the e-Infrastructure) + Pilots
implementation + Access to Data
O4: Networking activities + long-term sustainability +
Collaboration (i.e. PRACE, EGI).
10. Thank you for your attention
Miguel Ángel Esbrí
miguel.esbri@atos.net
http://www.euxdat.eu/
Twitter: @euxdat
This project has received funding from the European Union’s Horizon 2020
research and innovation programme under grant agreement No 777549
www.EUXDAT.eu
European e-Infrastructure
for Extreme Data Analytics
in Sustainable Development
Editor's Notes
O1: Develop a set of tools for managing extremely large datasets, taking into account storage requirements, different formats and managing policies for reducing data movement latency and protecting the information.
Analyse the way to use and store the different kinds of data (data streams, array databases, hyperspectral images, etc) by pilot applications, identifying the main bottlenecks, required metadata and ways to keep trace of data provenance;
Provide a specific tool for managing the storage and movement of large datasets from their source to target e-Infrastructures based on HPC and/or Cloud resources, able to connect a great variety of data sources with a plugin-like architecture, and exposing a common API for object-oriented data, databases and data streams;
Define adaptable data management policies, integrated in the management tools, taking into account constraints and other non-functional requirements (data availability, security, locality, complexity).
O2: Adapt and evolve, as required, data processing tools already available adding new features in such a way that they can be provisioned in a Large Data Analytics-as-a-Service way. These main changes will be focused on the capabilities to exploit HPC
capabilities with the new data management tools, the improvement of users’ portal and the adaptation of resources management.
We propose to perform a joint research related to the usage of a hybrid solution, in which part of the tasks goes to the Cloud and the other part goes to HPC, as a way to optimize the execution;
EUXDAT aims at improving the usage of the infrastructure with a complete portal for stakeholders which will not only facilitate development of applications and the request for improvements, but also keep trace of the way they are executed;
Adapt and expose data processing tools in such a way they can improve their parallel execution for scaling up appropriately as datasets increase their size.
O3: Carry out service activities based on an integrated e-infrastructure, where three data-intensive pilots from the Sustainable Development domain will validate the proposed solutions.
The different pilots proposed will exploit the maximum potential of the proposed e-Infrastructure, covering several topics related to land monitoring and sustainable management, energy efficiency in farms and 3D farming for soil protection;
A public instance of the e-Infrastructure will be publicly available during some time in order to allow stakeholders to experiment with the provided features.
O4: Carry out an important networking activity, especially in the domain of Sustainable Development, in order to motivate the adoption of the proposed tools among a wider European community.
Build a network of cooperating organisations, scientists, industries and citizens, who will cooperate on JRAs, through P4A and CoO;
To support long time sustainability of the platform encouraging a critical mass of stakeholders to utilise the EUXDAT e-infrastructure through several events and the open instance, in collaboration with computation providers such as PRACE.
We have three pilots:
-Resources optimization: Decide the best combinaton of crops
-Energy efficiency: Save energy in the farming process
-3D Farming: 3D analysis for precission farming
We will have different data sources:
-Information from stations at the farms
-Information from satellites: 4TB/day
-Information from UAVs (hyperspectral images) to monitor crop status and predict yield: 2.6 GB/m2 1km2 field: 2.6 PB in 20 minutes
-Information from machinery tracking (position, fuel consuption, engine rotation, etc.)
-Public datasets (Open Land Use , Open Transport Net): 1TB each
-Information from weather status and forecasts: 20 TB/ day
There are a set of components/aspects related to data:
-Data connectors
-Data catalogue
-Data movers
-Infrastructure adaptation (i.e. move in advance, keeping cache data, etc…)
Then, there is another part related to how the data is processed and the tools used for it:
-Tools catalogue and registry
-Orchestration for their execution
-Adaptation of tools (pCEP, Spark/Hadoop) and infrastructure (profiles)
The management of resources will be:
-A hybrid orchestration solution with Cloud and HPC resources
-With monitorization of resources and applications running
-Creating profiles about the usage of resources
pCEP can be used at the Edge (for some real-time analytics and filtering of too large data streams) and at HPC (real-time analytics of streams in parallel way)
Spark can be used mainly in Cloud for large (non-real-time) analytics. We’ll try to use it in HPC as well. Hadoop can be used for the same purpose and in the same platforms. We know that HLRS has a special version customized by Cray for HPC
Geotrellis and Geomesa are intended raster and vector data analysis respectively.
Jupyter has been proposed for visualization