4. Documentation and discoverability
1. Data scientists and analysts found it difficult to learn what a table is or how and where its used
2. Documentation often doesn’t exist!
Governance
3. We have no unified way to tag and identify sensitive data/PII
The Problems
5. Documentation and discoverability
1. Data scientists and analysts found it difficult to learn what a table is or how and where its used
a. Single searchable source of documentation
b. Data catalogs can display dependencies between tables/dashboards
2. Documentation often doesn’t exist!
a. The catalog’s existence provides motivation to write documentation
b. The catalog can be used to track down popular but undocumented tables
Governance
3. We have no unified way to tag and identify sensitive data/PII
a. Provides a unified tag database for tags from all integrated data sources
b. Tags can be auto-generated by PII-detection tools, as well as manually added
Data catalog to the rescue
6. Amundsen is:
● 🙂 Simple - exactly the feature set we needed
● 🎉 Open source =
○ less expensive
○ highly customizable
○ can be integrated with anything
○ part of a community
We chose Amundsen because
❤ open source!
9. We decided not
to let people edit
descriptions
At least, for now)
We wanted documentation to be version
controlled, and we wanted to the code to be the
single source of truth.
Since people work in code a lot of the time, we felt
that it is important for Amundsen to work with code
documentation, not against it.
We intend to revisit this though!
11. Package
Requirements
Issues
We fixed these!
Amundsen Databuilder had a pretty strict
requirements.txt, making it difficult to:
1. Integrate into our existing Airflow codebase.
2. Combine it with other libraries
Eventually we unpinned most of the dependencies,
and contributed this upstream!
12. Neo4J 4.x
We’d love to fix this upstream!
We decided to use Neo4J 4.x, because:
1. Backups are easier
2. It would give us the opportunity to
potentially switch to a SAAS e.g. neo4j Aura
Amundsen does not support Neo4J 4.x, so we
made a few tweaks!
As a result, we’re currently using some forks.
14. Looker Integration
.LookML files in
Github
Models and Views;
SQL queries
Dashboards via
API
Dashboards
reference Models
and Views
Looker’s
Sources of
truth
Amundsen
?Git clone?
15. Looker Integration
.LookML files in
Github
Models and Views;
SQL queries
Dashboards via
API
Dashboards
reference Models
and Views
Looker’s
Sources of
truth
Amundsen
?Git clone?
16. Looker Integration
Looker data lineage
extraction code
Docker Image for
Artifact Builder
(in ECR)
Models/Views
In LookML repo
Amundsen Airflow DAG
Combines lookml view data
With API dashboard data
Lineage.json
(in S3)
CICD uses
docker image to
build lineage.json
Docker image
built in CICD
1. Build docker images which convert LookerML files to JSON lineage data
2. Use the Docker images in the source repository’s build pipeline to
create artifacts and upload them to S3
3. Consumers download the
artifacts at run-time, combine with
dashboard data from API
Dashboards from
Looker API
17. Looker Integration - Stage 1
Model/View files
(In looker Github repo)
Lineage.json
(in S3)
Parse LookML files and
build lineage graph
during CI.
Neo4J
18. Looker Integration - Stage 2
Model/View files
(In looker Github repo)
Lineage.json
(in S3)
Amundsen Airflow Pipeline
Combines lookml view data
With API dashboard data
Dashboards
(From Looker API)
Neo4J
Parse LookML files and
build lineage graph
during CI.
Query dashboards
using Looker SDK
Download lineage
data from S3
23. Customized Snowflake Integration
One query can pull all the metadata
we need from many tables across
multiple databases:
● Columns
● Last update time
● Documentation
● Table usage