O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Dockerizing a multi-component Open Data app


Confira estes a seguir

1 de 31 Anúncio

Mais Conteúdo rRelacionado

Diapositivos para si (19)

Semelhante a Dockerizing a multi-component Open Data app (20)


Mais recentes (20)

Dockerizing a multi-component Open Data app

  1. 1. Dockerizing a multi- component Open Data app Athens Docker Meetup, June 2016 Dimitris Negkas, Stergios Tsiafoulis dimneg@gmail.com, s.tsiafoulis@gmail.com
  2. 2. Description and Scope LinkedEconomy (http://linkedeconomy.org/).  is a publicly available web platform and linked data repository.  its scope is to transform, curate, aggregate, interlink and publish economic data in machine- readable format, to enable  citizens awareness  research with unprecedented data  evidence-based policy
  3. 3. Data Sources  Sources Currently used:  Transparency – DIAVGEIA  Central Electronic Registry of Public Procurement - E- Procurement  National Strategic Reference Framework (NSRF)  Central Market of Thessaloniki (CMT)  e-Prices  Fuel Prices  Municipality of Athens, Municipality of Thessaloniki  Government of Australia
  4. 4. Data growth  we use Open Link Virtuoso for 15 different sources of nearly 1B triples  we host 27 datasets in CKAN from 15 organizations  data is increased respectively each month
  5. 5. Data processing  Each data source is separately handled and processed as its available data are not uniformly provided or in machine- readable format.  Diavgeia, “NSRF” and Observatories for product and fuel prices provide a rich API interface that can be easily queried in order to provide machine-readable data in JSON format.  In the cases of E-Procurement, “CMT” and “Municipalities of Athens and Thessaloniki” there is no API available. Thus, we have developed a software module, which gathers online information in an automated way, storing it in a machine-readable format.
  6. 6. General Architecture  Process model  Open economic data related to public budgeting, spending and prices are characterized of high volume, velocity, variety and veracity  We have to build custom components under the common logic of transforming static data to linked open data streams.
  7. 7. Process model: Nucleus  The nucleus of our approach is semantic modelling, data enrichment and interconnections.  Data are stored in raw (as harvested from sources), in RDF and json formats.
  8. 8. Process model : Data distribution  Enriched data are distributed though five channels: 1. Data dumps (CKAN), 2. SPARQL queries, 3. Web, 4. Social media 5. Structured inputs to Business Intelligence (BI) systems.  Additionally, data can be further analysed and exchanged with relevant platforms (e.g. SPARQL to R).
  9. 9. Process model : Validation and messenger  The validation component runs throughout the whole process in order to safeguard high data quality by detecting errors.  The messaging component works as an internal messaging and alert system for all components.
  10. 10. Process flow
  11. 11. Infrastructure Functionalities / Components Services / Data sources VM1 linkedeconomy.org apache, php, mysql, drupal VM2 SPARQL endpoint, demo site OLV, apache, php, mysql, drupal VM3 Harvester CouchDB, Lucene, apache, mysql / CKAN (Greek Datasets) VM4 Harvester, Messenger mysql, LinkedEconomy dropbox VM5 Storage - Secondary triplestore CouchDB, OLV, CouchDB-Lucene, docker VM6 Harvester apache, php, mysql, drupal / CKAN (Foreign Datasets) VM7 SPARQL endpoint OLV (Foreign graphs) VM8 Management JIRA, mysql, tomcat VM9 Dashboard front-end, CMS, INSPINIA VM10 System administration VPN, firewalls, etc. Physical Storage - Core triplestore OLV (Greek graphs) As core infrastructure we use ~okeanos, which is an established cloud-based service provided for the Greek research and academic community.
  12. 12. LinkedEconomy
  13. 13. CKAN
  14. 14. “Hottest” Prices per municipality
  15. 15. Supermarkets Geoinformation
  16. 16. Application System Small Applications Java, Php and UNIX Scripts Di@vgeia KHMDHS Virtuoso CouchDB Drupal MySql ePrices CKAN fuelPricesQGIS
  17. 17. Dockerize the System Di@vgeia KHMDHS ePrices Virtuoso Drupal MySql QGIS Desktop CouchDB QGIS Server Small Applications CKAN
  18. 18. With Compose 2
  19. 19. Docker MySQL  version: '2'  services:  mysql:  build: ./mysql-docker/5.6  container_name: eLodDrupalmySQL  volumes:  - /mysql_drupal:/var/lib/mysql  environment:  - MYSQL_DATABASE=drupalelod  - MYSQL_ROOT_PASSWORD=eLodmysqlpass  restart: on-failure Save your data !! Will build the image from your directory Do not use flag “always” in your development environment!
  20. 20. Docker Drupal  drupal:  build: ./docker-drupal  command:  - /start.sh  depends_on:  - mysql  container_name: eLodDrupal  #image: eLodDrupal  ports:  - "8081:80"  volumes:  - "/data_drupal:/var/www/html"  links:  - "mysql"  environment:  - MYSQL_DATABASE=drupalelod  - MYSQL_USER=root  - MYSQL_PASSWORD=eLodmysqlpass  - DRUPAL_ADMIN_PW=eLODDR  - DRUPAL_ADMIN=admin  - MYSQL_HOST=eLodDrupalmySQL  - DRUPAL_ADMIN_EMAIL=stetsiafoulis@gmail.com  restart: on-failure Will start the service only after MySQL service Will link the container with MySQL container
  21. 21. Docker Virtuoso  virtuoso:  build: ./docker-virtuoso  container_name: eLodVirtuoso  ports:  - "8890:8890"  volumes:  - /virtuoso/db:/var/lib/virtuoso/db  environment:  - DBA_PASSWORD=eLodVir  - SPARQL_UPDATE=true  - DEFAULT_GRAPH=http://localhost:8890/DAV  restart: on-failure
  22. 22. Docker QGIS  qgisdesktop:  #image: kartoza/qgis-desktop:2.14  build: ./qgis-desktop/2.14  hostname: qgis-server  volumes:  #Wherever you want to mount your data from  - ./gis:/gis  #Unix socket for X11  - "/tmp/.X11-unix:/tmp/.X11-unix"  links:  - db:db  environment:  - DISPLAY=unix:1  command: /usr/bin/qgis
  23. 23. Build the system  Clone the repository from github https://github.com/stetsiafoulis/eLOD  Create the directories where you are going to link your data  Enter docker-compose up -d and that’s it !!
  24. 24. Why Docker ? o Portable o Lightweight o Move to different cloud infrastructures and to Physical servers o Run on Virtual Machines for development and testing o Easily Scale o Easy Delivery and deployment o Run Anywhere (regardless host distro, physical, cloud or not ) o Run Anything
  25. 25. What’s Next ??
  26. 26. Scaling per Source Di@ygeia KHMDHS Virtuoso Drupal MySql QGIS Desktop CouchDB QGIS Server Small Applications Virtuoso Drupal MySql CouchDB QGIS Server Small ApplicationsQGIS Desktop
  27. 27. Run Small Apps through Docker API Small Applications
  28. 28. Next Steps - Swarm Virtuoso Drupal MySql CouchDB QGIS Server Cluster management Scaling State reconciliation Multi-host networking Service discovery Load balancing
  29. 29. Next Steps - Consul Health CheckingService Discovery Multi Datacenter support
  30. 30. Any Questions ??
  31. 31. Appendix - Data Sources links  LinkedEconomy (http://linkedeconomy.org/).  linkedeconomy@gmail.com  Sources Currently used:  Transparency - DIAVGEIA: https://diavgeia.gov.gr  Central Electronic Registry of Public Procurement - E-Procurement (KHDMHS): http://www.eprocurement.gov.gr  National Strategic Reference Framework (NSRF):https://www.espa.gr/en  Central Market of Thessaloniki (CMT):http://www.kath.gr/  e-Prices: http://www.e-prices.gr/  Fuel Prices: http://www.fuelprices.gr/  Municipality of Athens: https://www.cityofathens.gr/khe/proypologismos  Municipality of Thessaloniki: http://www.thessaloniki.gr/portal/page/portal/DioikitikesYpiresies/GenDnsiDioikOikonYpiresion/DnsiDiafanEksipirDimoton/Tmima Diafaneias/AnoiktiDdiathesiDedomenon/DimosiefsiEktelesisProipologismou/ektelesi-proypologismou  Government of Australia: http://data.gov.au/

Notas do Editor

  • Open economic data related to public budgeting, spending and prices are characterized by high volume, velocity, variety and veracity.
  • 10 virtual machines with memory and storage capacities that span from 2GB to 8GB RAM and 20GB to 100GB respectively, as well as a non-commodity (physical) server of 12 CPUs, 64GB RAM and a storage capacity of more than 4TB.
  • This map shows which municipalities are the most expensive on a specific product ie. Milk, fruits, or petrol etc
    The scale of the color gives a perception of the price of the product to a municipality.. More red more expensive.
  • Also we are using QGIS in order to display on the map geoinformation of the supermarkets or other POIs
  • The system consists of : CKAN data portal, Drupal, Virtuoso, MySQLs, QGIS server, CouchDB and many scripts of different technologies and scope.
    We are using such a system of apps in order to elaborate information from different data sources.

    As we mentioned before the system is established on a cloud-based infrastructure ~okeanos.
    There is a need in some cases to move the system or back it– up on different cloud or physical infrastructures.
    Here is where Docker came and help us to achieve that , almost very easily and without many efforts.
  • We started to dockerize the services one by one until we decided use the new Compose 2.
    Compose creates the entire system with a single command.
    docker-compose up –d

    And not only that, also it creates an internal network and attaches the containers to that automatically.

  • Policy
    Do not automatically restart the container when it exits. This is the default.
    Restart only if the container exits with a non-zero exit status. Optionally, limit the number of restart retries the Docker daemon attempts.
    Always restart the container regardless of the exit status. When you specify always, the Docker daemon will try to restart the container indefinitely. The container will also always start on daemon startup, regardless of the current state of the container.
    Always restart the container regardless of the exit status, but do not start it on daemon startup if the container has been put to a stopped state before.
    An ever increasing delay (double the previous delay, starting at 100 milliseconds) is added before each restart to prevent flooding the server. This means the daemon will wait for 100 ms, then 200 ms, 400, 800, 1600, and so on until either the on-failure limit is hit, or when you docker stop or docker rm -f the container.
    If a container is successfully restarted (the container is started and runs for at least 10 seconds), the delay is reset to its default value of 100 ms.
    You can specify the maximum amount of times Docker will try to restart the container when using the on-failure policy. The default is that Docker will try forever to restart the container. The number of (attempted) restarts for a container can be obtained via docker inspect. For example, to get the number of restarts for container “my-container”;
  • Cluster management integrated with Docker Engine: Use the Docker Engine CLI to create a Swarm of Docker Engines where you can deploy application services. You don’t need additional orchestration software to create or manage a Swarm.

    Decentralized design: Instead of handling differentiation between node roles at deployment time, the Docker Engine handles any specialization at runtime. You can deploy both kinds of nodes, managers and workers, using the Docker Engine. This means you can build an entire Swarm from a single disk image.

    Declarative service model: Docker Engine uses a declarative approach to let you define the desired state of the various services in your application stack. For example, you might describe an application comprised of a web front end service with message queueing services and a database backend.

    Scaling: For each service, you can declare the number of tasks you want to run. When you scale up or down, the swarm manager automatically adapts by adding or removing tasks to maintain the desired state.

    Desired state reconciliation: The swarm manager node constantly monitors the cluster state and reconciles any differences between the actual state your expressed desired state. For example, if you set up a service to run 10 replicas of a container, and a worker machine hosting two of those replicas crashes, the manager will create two new replicas to replace the ones that crashed. The swarm manager assigns the new replicas to workers that are running and available.

    Multi-host networking: You can specify an overlay network for your services. The swarm manager automatically assigns addresses to the containers on the overlay network when it initializes or updates the application.

    Service discovery: Swarm manager nodes assign each service in the swarm a unique DNS name and load balances running containers. You can query every container running in the swarm through a DNS server embedded in the swarm.

    Load balancing: You can expose the ports for services to an external load balancer. Internally, the swarm lets you specify how to distribute service containers between nodes.

    Secure by default: Each node in the swarm enforces TLS mutual authentication and encryption to secure communications between itself and all other nodes. You have the option to use self-signed root certificates or certificates from a custom root CA.

    Rolling updates: At rollout time you can apply service updates to nodes incrementally. The swarm manager lets you control the delay between service deployment to different sets of nodes. If anything goes wrong, you can roll-back a task to a previous version of the service.

  • What is Consul?
    Consul has multiple components, but as a whole, it is a tool for discovering and configuring services in your infrastructure.
    It provides several key features:
    Service Discovery: Clients of Consul can provide a service, such as api or mysql, and other clients can use Consul to discover providers of a given service. Using either DNS or HTTP, applications can easily find the services they depend upon.
    Health Checking: Consul clients can provide any number of health checks, either associated with a given service ("is the webserver returning 200 OK"), or with the local node ("is memory utilization below 90%"). This information can be used by an operator to monitor cluster health, and it is used by the service discovery components to route traffic away from unhealthy hosts.
    Key/Value Store: Applications can make use of Consul's hierarchical key/value store for any number of purposes, including dynamic configuration, feature flagging, coordination, leader election, and more. The simple HTTP API makes it easy to use.
    Multi Datacenter: Consul supports multiple datacenters out of the box. This means users of Consul do not have to worry about building additional layers of abstraction to grow to multiple regions.