2. мне приятно быть здесь с
тобой
большое Вам спасибо
Image credit: Turkish Airlines
3. tati_alchueyr.__doc__
● Brazilian living in London since 2014
● Senior Data Engineer at the BBC Datalab team
● Graduated in Computer Engineering at Unicamp
● Passionate software developer for 16 years
● Experience in the private and public sectors
● Developed software for Medicine, Media and Education
● Loves Open Source
● Loves Brazilian Jiu Jitsu
● Proud mother of Amanda
Software InVesalius 3
6. help(bbc)
● British Broadcasting Corporation
● Values
○ Independent, impartial and honest
○ Audiences are at the heart of everything we do
○ We take pride in delivering quality and value for money
○ Creativity is the lifeblood of our organisation
○ We respect each other and celebrate our diversity so
that everyone can give their best
● Purpose
○ Inform
○ Educate
○ Entertain
New Broadcasting House
London, UK
7. bbc.stats()
➢ BBC TV reaches 91% UK adult population
➢ BBC News reaches 426 million global audience weekly
Reference 1: BBC
Reference 2: BBC
Image Credit: BBC
8. bbc. .mission
“Bring the BBC’s data together
accessible through a common platform,
along with flexible and scalable tools to
support machine learning to enable
content enrichment and deeper
personalisation”
9. Some of the Datalab team members (15 August 2019)
bbc. .mission
22. scheduling workflows cron jobs
Several cron jobs running...
It seems a critical job didn’t run
last night...
Didn’t it run? Did it fail?
Why could it have failed?
Original image credit: XKCD
27. airflow why
● Handle complex relationships between jobs
● Handle all the jobs centrally with a well defined user
interface
● Error reporting and alerting
● Viewing and analyzing job run times
● Security (protecting credentials of databases)
28. airflow why not
● In many cases, cron jobs are the simplest and most
effective tool
● Airflow is a complex tool made of several components
○ Learning curve
○ Infrastructure management cost
29. airflow concepts (i) DAG
● All workflows are considered to be DAGs
○ DAG: Direct Acyclic Graph
nodes
direct edge
35. airflow concepts (ii) DAG properties
● DAGs (usually) have:
○ schedule
○ start time
○ unique name (ID)
○ nodes: jobs (instances of Operators)
○ edges: dependencies between the nodes
36. airflow concepts (iii) operators
● Operators define the task or job
○ BashOperator: execute shell commands/scripts
○ PythonOperator: execute Python code
○ BranchPythonOperator: execute a code if condition
○ SlackOperator
○ (...)
○ Custom operators
48. scars of experience installing python packages
● When using a Python Operator, the job is run within the
worker
● Therefore, by default, Python dependencies are
installed globally to the workers
● In other words, application deployments can break your
Airflow environment
49. scars of experience installing python packages
● Isolate the execution from the scheduling, when
reasonable
● To debug native operators means to debug Airflow itself
● Alternatives to isolate them:
○ PythonVirtualenvOperator
○ DockerOperator
○ KubernetesPodOperator
○ GceInstanceStartOperator
Interest reading: Medium
50. scars of experience debugging
There was a breaking change in an
Airflow plugin, the scheduler couldn’t
process the DAG
51. scars of experience debugging
The DAG in the worker instances was
deleted but its metadata was no longer
available in the scheduler
52. scars of experience debugging
● Error messages are not always obvious
○ Understand what is happening in the system
○ The webserver and scheduler are independent
processes
55. scars of experience versioning can be tricky
● Log the version of the Dag Operator and Plugins when
they are run
● When catchup is enabled, new jobs will be added to
previous executions
58. scars of experience using xcom between jobs
● Alternatives
○ By default, the return value of the operator execute
method is stored in XCom
○ XCom values are stored in the Airflow metadata DB
○ Avoid using XCom
○ Store the state in data stores (databases, object
stores, etc)
60. scars of experience breaking changes
● Minor versions of Airflow can introduce breaking changes
○ Example: named parameter in S3Hook (1.8 -> 1.9)
■ aws_conn_id
■ s3_conn_id
Reference: Airflow development mailing list
61. where did all the magic of
machine learning workflows go?
Image credit: XKCD
64. Interest reading: Medium
airflow machine learning specifics
● Machine learning jobs are similar to usual jobs
● Factors which can affect the operator choice:
○ is the model built using the same Python version?
○ how much CPU and memory does your model need?
○ how can you make Airflow use your existing
infrastructure
○ how many concurrent workers do you need?
■ Limitation on scaling celery executors
■ Kubernetes executors in early stage