Anúncio
Anúncio

Mais conteúdo relacionado

Similar a Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years of Serverless Toronto(20)

Mais de Daniel Zivkovic(20)

Anúncio

Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years of Serverless Toronto

  1. 5 Ugo Udokporo of GCP: Building Secure Serverless Delivery Pipelines on GCP Nadji Bessa of Infostrux Solutions: Trends in the Data Engineering Consulting Landscape Jacob Frackson of Montreal Analytics: From Data-driven Business to Business-driven Data (Hands-on Data Modelling exercise) Canadian Experts Discuss Modern Data Stacks and Cloud Computing
  2. From Data-driven Business to Business-driven Data
  3. Jacob Frackson (he/him) Practice Lead, Analytics Engineering
  4. ● Data is being generated in many different ways across the business, and it’s very source-centric ● Stakeholders are thinking about business problems, and in a business-centric way Business Context
  5. ● Translating Business Questions into Data Questions – but what if we can help bridge the gap? ● Data models are the abstraction layer, the API that give your stakeholders rich access to data without needing to know its nuances Data Sources . . . . . Data Model . . . . . Business Users Why a data model?
  6. ● Kimball Dimensional Modelling ● Inmon Enterprise Data Warehousing ● Data Vault (2.0) ● One Big Table Which methodology?
  7. ● You have business questions about the checkout flow on your website: ● The flow: ○ User visits a product page ○ User clicks on a product ○ User adds the item to their cart ○ User checks out the cart Example: Checkout Flow
  8. ● You have business questions about the checkout flow on your website: ○ [Finance] How much revenue is coming in online and from what products? ○ [Marketing] Which channels and platforms are converting and which aren’t? ○ [Product] How many pages does the average customer look at before buying? ○ [Operations] When are orders coming in and for what geos? Check out this book for a more detailed explanation: Example: Checkout Flow
  9. BEAM Canvas
  10. Choosing a Fact Type
  11. Example: Checkout Flow
  12. Which fact types are most appropriate for each question? ● [Finance] How much revenue is coming in online and from what products? ○ TF, ASF, or PSF ● [Marketing] Which channels and platforms are converting and which aren’t? ○ ASF or PSF ● [Product] How many pages does the average customer look at before buying? ○ CF, ASF, or PSF ● [Operations] When are orders coming in and for what geos? ○ TF, ASF, PSF We’ll start with an ASF, and then potentially a CF or PSF Example: Checkout Flow
  13. Conclusion ● Prioritize the implementation of your data model ● Build on top of it: ○ Business Intelligence ○ Machine Learning ○ Reverse ETL ○ And beyond! ● Other skills to learn: ○ Analytics Engineering and dbt ○ RBAC and Access Control Models ○ Database or data warehouse optimization
  14. Thank you! Questions?
  15. Trends in the Data Engineering Consulting Landscape Nadji Bessa, Engineering Manager
  16. Agenda Explore common trends in data engineering across.. ● Projects ● Data Engineering (practise) ● Dbt ● Tooling
  17. Projects
  18. What are customers asking for? Some of the markets we have worked with are: ● Financial institutions ● Pharmaceuticals ● Retailers ● Wholesalers ● Etc… Overwhelmingly, many data engineering projects are driven by Business Analysis/Business Intelligence enablement objectives. We, however, also see a small percentage of Data Science work.
  19. What are our client’s needs? All types of companies are making an attempt to become more data-driven. Although some sort of domain-specific expertise it is need to successfully complete a project, fundamentally, once we get to the level of the data, we can observe similar patterns repeat themselves across all business verticals. Their data needs are essentially the same.
  20. What data visualization platforms are the most prominent? ● Tableau ● Power BI ● Sigma ● Looker
  21. What are the biggest strategic challenges in tackling data engineering projects? From a strategy standpoint, it is hard to do good data cloud projects without first having a good cloud infrastructure (or at least a good* IT infrastructure) - cloud enablement must precede data cloud enablement
  22. What are the biggest operational challenges in tackling data engineering projects? Having a consultative engagement with all stakeholders early on in the lifecycle of a project* . Having an effective collaboration with our customers while delivering a solution**.
  23. What are the tactical challenges in tackling data engineering projects? Not having access to the environment Working with a disparity of data stack tools - it is often imperative to standardize on some tool stack before being able to effectively collaborate The rapid pace of change in tooling as well as its impact of training and keeping technical resources’ skills relevant
  24. Data Engineering
  25. How should you classify your data? There are no noticeable patterns, and as an organization, we tend to recommend the following. Classify your data by: ● Environment ● Processing State ● Non-functional Aspects of Architecture ● Data usage pattern ● Business Domain or Area ● Project ● Product ● Tenant or Customer ● Organization Structure
  26. Do you implement the same data structures across different projects? For example, we subscribe to favouring ELT vs ETL as a model for ingesting data into our data warehousing platform. And we subscribe to clearly delineated data architecture where we have an ingest, clean, normalize, integrate, analyze and egress layer… but these design principles are loosely held strong beliefs… It is important to do what is right for the customer and that means simplifying or eliminating certain steps if they are not necessary.
  27. Which aspect of a data engineering project is the most difficult? Based on what I have seen so far… The most important item would be documentation - without it, it is impossible to start any data engineering project… A close second would be Data Quality: With any other broad aspects of data management, if the technology is not mature enough, processes can be put in place to compensate for that… This is the single most difficult item to get right the first time around and to keep in a good state moving forward.
  28. dbt An excerpt from content published in: https://medium.com/infostrux-solutions/crafting-better-dbt-projects- aa5c48aebfc9
  29. Data Staging Layers There would be six sub-directories under the dbt model’s directory, representing the previously mentioned layers i.e. ingest, clean, normalize, integrate, analyze, and egest. Note that ingest, clean, and normalize are organized by the data sources.
  30. Model Configs We recommend defining model configs in the dbt_project.yml file (not in each model header or a .yml file under models’ sub-directories - this helps to avoid code redundancy. (to be continued)
  31. Model Configs (continuation) If we need to provide special configs for specific models in the directory, we can provide them in models’ headers which will override the configs in the dbt_project.yml file: (to be continued)
  32. Model Configs (continuation) For each model, we recommend having a .yml file (model_name.yml) with the descriptions under that model’s directory:
  33. Sources Only the ingest layer should contain information about sources (sources’ descriptions in .yml files). Different subcategories of sources should be stored separately. Therefore, different subfolders under the ingest folder should be created for different sources. We recommend creating a separate .yml file per source table (source_table_name.yml) under the corresponding directory.
  34. Style Guide Poorly written code is nothing other than technical debt as it increases implementation time and costs! We would recommend that you develop a custom SQL Style Guide to develop models. This guide should be adapted from the dbt Style Guide and a few others with the goal of maximizing code maintainability.
  35. Automation Automating checks for adherence to code style guides is probably the only sane way to enforce them. Linters exist for exactly that purpose. They should be part of any project’s CI pipeline to ensure code merged to all repos follows the same standard. Of particular interest is SQLFluff (https://github.com/sqlfluff/sqlfluff) and the SQLFluff extension to Visual Studio Code (https://marketplace.visualstudio.com/items?itemName=dorzey.vscode-sqlfluff) which helps developers ensure code is style-conformant before they submit it to the CI pipeline.
  36. DBT tests DBT tests are used if it is required to check data transformations and the values of the source data. We will be digging into this more in a future article.
  37. Source Freshness dbt provides source freshness check functionality right out of the box, and as we know, data providers can fail to deliver a source file. Automated ingestion of source data files can fail as well. Both scenarios can result in stale/inaccurate data. Setting up source data freshness checks to ensure that dbt models work with the current data is advisable.
  38. Version Control All dbt projects should be managed in a version control system such as git. As a team, we advise that you should pick a branching strategy that works for you, some of them are Git flow, GitHub Flow or trunk-based development.
  39. CI/CD for dbt To ensure code and implementation quality, CI/CD tools should include linting and unit tests before any branch is allowed to be merged into development to enforce coding standards as well as validate the integrity of the implementation.
  40. Environments For production and development purposes, we use different environments — PROD and DEV. We support all six layers of our data staging model in the DEV environment. Environments are defined by providing only one-env_name variable instead of using the dbt standard approach (such as target.name, target.database internal variables). This makes the configuration more flexible when we switch environments or add a new environment.
  41. Environment Variables When generating database object names, provide environment-related variables as dbt variables and not refer to dbt internal environment variables (such as target.name, target.database, etc) sometimes can be a more effective solution. For instance, in the sample project below, database names are being generated using the env_name variable and are fully independent of dbt environment settings (to be continued)
  42. Environment Variables (continuation) In dbt_project.yml file: – #Define variables here #DEV or PROD. It is used to generate the environment name for the source database.#DEV by default. If it is not provided-then DEV_<DB_NAME> (DEV_INGEST for example), if provided- <env_name>_<DB_NAME> (PROD_INGEST).vars: env_name: 'DEV' – (to be continued)
  43. Environment Variables (continuation) Database name generation macro: -- e.g. dev_clean or prod_ingest, where clean and ingest are the 'stage_name' --#> MACRO {% macro generate_database_name(stage_name, node) %}{% set default_database = target.database %} {% if stage_name is none %} {{ default_database }} {% else %} {{ var("env_name") }}_{{ stage_name | trim }} {% endif %}{% endmacro %} --#< MACRO (to be continued)
  44. Environment Variables (continuation) The variable is provided to the dbt command if we need to use other values than the default. For example: dbt run --vars 'env_name: "PROD"' And no need to provide anything for the DEV as it uses the default value: dbt run In the case of switching between different environments, this solution can be helpful as there is no need to update environment settings.
  45. Data load Data from the sources is loaded only into the PROD_INGEST database. All layers above are being deployed by DBT models. Moreover, models of each layer refer only to models from previous layers or the same layer. To deploy the DEV environment, the DEV_INGEST database is cloned from the PROD_INGEST database (unless there is a requirement to move DEV data separately) and all preceding layers of the DEV environment are created by DBT models. Seeds can be loaded in different layers depending on their usage.
  46. Dev Environments We can generate dev environments by by cloning ingest layer of the PROD environment. Typically we would try to have all six layers of our architecture in dev as well.This can be achieved by creating the ingest layer for DEV (all other layers will be created by dbt models using the ingest layer) by cloning the ingest layer of the prod environment. The cloning can be defined in a macro (a simple cloning macro below): – {% macro clone_database(source_database_name, target_database_name) %} {% set sql %} CREATE OR REPLACE DATABASE {{target_database_name}} CLONE {{source_database_name}}; {% endset %} {% do run_query(sql) %}{% endmacro %} – Then, cloning can be run as a dbt operation by a job: – dbt run-operation clone_database --args '{source_database_name: PROD_INGEST, target_database_name: DEV_INGEST}' — Please note that the user running the job should have OWNERSHIP permission to the target database as the job replaces the existing database.
  47. Tooling
  48. What are the most popular source systems? This is what our clients have used or are using so far: ● Fivetran ● Airbyte ● Matillion ● Snaplogic ● Supermetrics ● Talend ● AWS Glue, to name a few…
  49. What data ingestion tools/platforms are the most popular? The source systems are: ● Mostly structured data (SQL) hosted on MS-SQL/MySQL Servers on-premise or in the Cloud ● Occasionally semi-structured data (JSON) and very little unstructured data - mostly as individual files in some data lake ( S3 on AWS is by far the favourite)
  50. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 1/14 Ugo Udokporo Following Jan 17 · 5 min read · Listen Building a software delivery pipeline using Google Cloud Build & Cloud deploy Hey Folks!!!! In an earlier post we went through a step-by-by guide on building Google Kubernetes Engine clusters using the gitOps methodology. In this blog we would attempt to build an end-to-end nginx service delivery pipeline on the pre-built clusters (dev, uat & prod) leveraging Google Cloud Build and Google Cloud Deploy. Lets get started!!! The Architecture Priyanka Vergadia created a great architecture that helps us understand the pipeline flow. This architecture can also be used to implement a phased production rollout that can span multiple GKE regional clusters (e.g prod- us- east1, prod-asia-east1 etc). Search Medium Write
  51. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 2/14 Google Cloud Deploy is a managed service that automates delivery of your applications to a series of target environments in a defined promotion sequence. When you want to deploy your updated application, you create a release, whose lifecycle is managed by a delivery pipeline. by Priyanka Vergadia Our implementation would be based of this git repo, so lets do a quick walk through of it’s contents. Cloudbuild.yaml 5
  52. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 3/14 The cloudbuild yaml consist of four steps: a docker build & tag step, a docker push to Google Container registry (GCR)step, a cloud deploiy pipeline registering step and a release creation step. More info on cloud build can be found here steps: - id: 'build nginx image' name: 'gcr.io/cloud-builders/docker' args: ['build', '-t', 'gcr.io/$DELIVERY-PROJECT_ID/nginx:1.1.0', 'nginx/' ] # Push to GCR - name: 'gcr.io/cloud-builders/docker' id: 'Pushing nginx to GCR' args: ['push', 'gcr.io/$DELIVERY-PROJECT_ID/nginx:1.1.0'] - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk' id: 'Registering nginx pipeline' entrypoint: 'bash' args: - '-c' - gcloud deploy apply --file=clouddeploy.yaml --region=us-central1 --project=$ - name: 'gcr.io/google.com/cloudsdktool/cloud-sdk' entrypoint: 'bash' args: - '-c' - > gcloud deploy releases create release-$BUILD_ID --delivery-pipeline=nginx-pipeline --region=us-central1 --images=userservice=gcr.io/$DELIVERY-PROJECT_ID/nginx:1.1.0
  53. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 4/14 Clouddeploy.yaml The Google Cloud Deploy configuration file or files define the delivery pipeline, the targets to deploy to, and the progression of those targets. The delivery pipeline configuration file can include target definitions, or those can be in a separate file or files. By convention, a file containing both the delivery pipeline config and the target configs is called clouddeploy.yaml , and a pipeline config without targets is called delivery-pipeline.yaml . But you can give these files any name you want. Our configuration defines three GKE targets (dev, uat & prod)built across two regions (us-central1 & us-west1). apiVersion: deploy.cloud.google.com/v1beta1 kind: DeliveryPipeline metadata: name: nginx-pipeline description: Nginx Deployment Pipeline serialPipeline: stages: - targetId: dev - targetId: uat - targetId: prod --- apiVersion: deploy.cloud.google.com/v1beta1 kind: Target
  54. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 5/14 metadata: name: dev description: dev Environment gke: cluster: projects/$DEV-PROJECT_ID/locations/us-west1/clusters/dev-cluster --- apiVersion: deploy.cloud.google.com/v1beta1 kind: Target metadata: name: uat description: UAT Environment gke: cluster: projects/$UAT-PROJECT_ID/locations/us-central1/clusters/uat-cluster --- apiVersion: deploy.cloud.google.com/v1beta1 kind: Target metadata: name: prod description: prod Environment gke: cluster: projects/$PROD-PROJECT_ID/locations/us-west1/clusters/prod-cluster Nginx folder This consist of the nginx Dockerfile and its build dependencies.
  55. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 6/14 Skaffold.yaml Skaffold is a command line tool that facilitates continuous development for container based & Kubernetes applications. Skaffold handles the workflow for building, pushing, and deploying your application, and provides building blocks for creating CI/CD pipelines. This enables you to focus on iterating on your application locally while Skaffold continuously deploys to your local or remote Kubernetes cluster, local Docker environment or Cloud Run project. apiVersion: skaffold/v2beta16 kind: Config deploy: kubectl: manifests: ["app-manifest/nginx.yaml"] app/manifest/nginx.yaml
  56. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 7/14 apiVersion: apps/v1 # for versions before 1.9.0 use apps/v1beta2 kind: Deployment metadata: name: nginx spec: strategy: type: Recreate selector: matchLabels: app: nginx replicas: 3 # tells deployment to run 1 pods matching the template template: # create pods using pod definition in this template metadata: labels: app: nginx spec: containers: - name: nginx image: gcr.io/$DELIVERY-PROJECT_ID/nginx:1.1.0 ports: - containerPort: 80 --- apiVersion: v1 kind: Service metadata: name: nginx namespace: default labels: app: nginx spec: externalTrafficPolicy: Local ports: - name: http port: 80 protocol: TCP targetPort: 80 selector:
  57. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 8/14 app: nginx type: LoadBalancer Build time!!!! Step 1: Clone and recreate git repo Step 2: Grant the N-computer@developer.gserviceaccount.com in dev, uat & prod permission to container registry in the delivery-pipeline project. Step 3: Grant the N-computer@developer.gserviceaccount.com from the delivery- pipeline project, Kubernetes Engine Developer role access in dev, uat & prod projects Step 4: Create and run a cicd-nginx pipeline build trigger in cloud build. This can also be done using terraform as part of IaC
  58. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 9/14 nginx-pipeline cloud build trigger
  59. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 10/14 successful nginx-pipeline build history nginx cloud deploy pipeline Step 5: Promote build from dev-to-uat-to-prod. This is done by clicking promote and deploy.
  60. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 11/14 This is the process of advancing a release from one target to another, according to the progression defined in the delivery pipeline. When your release is deployed into a target defined in your delivery pipeline, you can promote it to the next target.
  61. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 12/14
  62. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 13/14 You can require approval for any target, and you can approve or reject releases into that target. Approvals can be managed programmatically by integrating your workflow management system (such as ServiceNow), or other system, with Google Cloud Deploy using Pub/Sub and the Google Cloud Deploy API. To require approval on any target, set requireApproval to true in the target configuration: apiVersion: deploy.cloud.google.com/v1beta1 kind: Target metadata: name: prod description: prod Environment requireApproval: true gke: cluster: projects/$PROD-PROJECT_ID/locations/us-west1/clusters/prod-cluster Congratulations!!! You made it. Now, changes made to the nginx git repo are automatically built and deployed to dev with a promotion/rolloback option to/from higher environment. Official product links can be found here:
  63. 3/5/23, 10:26 PM Building a software delivery pipeline using Google Cloud Build & Cloud deploy | by Ugo Udokporo | Jan, 2023 | Medium https://medium.com/@ugochukwu007/building-a-software-delivery-pipeline-using-google-cloud-build-cloud-deploy-9b8574a863a4 14/14 Google Cloud Deploy — https://cloud.google.com/deploy Google Cloud Deploy Terminology — https://cloud.google.com/deploy/docs/terminology Creating Delivery pipeline and targets — https://cloud.google.com/deploy/docs/create-pipeline-targets
  64. www.ServerlessToronto.org Reducing the gap between IT and Business needs
Anúncio