Building Robust Pipelines with Airflow

•

1 gostou•1,956 visualizações

The data science team at Zymergen is applying machine learning techniques to identify genetic targets, work that is supported by extensive analytical automation that systematically identifies outliers, removes process-related bias, and quantifies performance improvements. We’re using Apache Airflow to construct robust data pipelines that allow us to produce clean, reliable inputs to our predictive models. In this talk, I’ll discuss the unique data processing challenges we face in working with high-throughput, biological data and provide an overview of how we’re using Apache Airflow to meet those challenges.

Tecnologia

@erinshellman
Wrangle Conf
July 20th, 2017
Building Robust
Pipelines with Airﬂow

Zymology: is the science of fermentation and it’s applied to
make materials and molecules
!
"
#
$
Beer
Insulin
Food
additives
Plastics

Zymergen provides
a platform for
rapid improvement
of microbial strains
through genetic
engineering.

Robotic automation
Our experimentation
is increasingly
orchestrated with
robotics and machine
learning.

Learning how to efﬁciently
navigate the genome is the mission
of data science at Zymergen

Blocker: process failure
Orchestrating complex experiments with
robots is hard, and there are process failures.
These failures often cause sporadic, extreme
measurement values.

Blocker: batch effects
We see temporal
effects based on
when experiments
were performed

Blocker:
different interpretations of results
We’re building a platform that can
support any microbe and any molecule.
Sometimes that results in a proliferation
of solutions with disagreement on which
is best.

Processing pipeline
1.Identify process failures
2.Quantify and remove process-
related bias
3.Identify strains that show
improvement using consistent
criteria
Clean model inputs
Outlier detection
Normalization
Hit detection

Rolling our own ETL pipeline
There are many
ways to measure
the concentration of
a molecule.
Any microbe, any
molecule… any
experiment, many
data formats.

Describing complex
processing
dependencies is
hard.
Rolling our own ETL pipeline

Airﬂow
https://airﬂow.incubator.apache.org/
“Airﬂow is a platform to programmatically author, schedule and
monitor workﬂows.”
Airﬂow gives us ﬂexibility to apply a common
set of processing steps to variable data
inputs, schedule complex processing
workﬂows, and has become a delivery
mechanism for our products.

e.g. Normalization
Airﬂow workﬂows are
described as directed
acyclic graphs (DAGs).
Each task node in the
DAG is an operator.

The anatomy of a
DAG
Custom operators
Ordering
Instantiate DAG

Airﬂow + PyStan
With Bayesian hierarchical models we estimate
(and monitor) the distribution of batch effects.
Experimental bias

DropBox
• Scientists at Zymergen work with data using
many different tools including JMP, SQL, and
Excel.
• We use a custom DropBox hook to make
quick data ingestion pipelines.

Pairs well with Superset!
“Apache Superset is a
modern, enterprise-ready
business intelligence web
application”
https://github.com/apache/incubator-superset

Fairﬂow: Functional Airﬂow
• The core of Fairﬂow is an
abstract base class foperator
that takes care of
instantiating your Airﬂow
operators and setting their
dependencies.
• In Fairﬂow, DAGs are
constructed from foperators
that create the upstream
operators when the ﬁnal
foperator is called.

Conﬁguring complex ML
workﬂows… functionally

Deﬁning ML workﬂows
In the DAG
deﬁnition, create
an instance of the
task.
Then, instantiate a
DAG like usual and
call the compare
task on the DAG.

Deﬁning ML workﬂows
The design allows for simple creation of
complicated experimental workﬂows with arbitrary
sets of models, parameters, and evaluation metrics.

Is Airﬂow for you?
Do you have heterogeneous data sources?
Do you have complex dependencies between
processing tasks?
Do you have data with different velocities?
Do you have constraints on your time?
Probably!

Mais conteúdo relacionado

Mais procurados

Presentation given at Coolblue B.V. demonstrating Apache Airflow (incubating), what we learned from the underlying design principles and how an implementation of these principles reduce the amount of ETL effort. Why choose Airflow? Because it makes your engineering life easier, more people can contribute to how data flows through the organization, so that you can spend more time applying your brain to more difficult problems like Machine Learning, Deep Learning and higher level analysis.

Apache Airflow Architecture

Gerard Toonstra

Apache Airflow Introduction

Liangjun Jiang

How I learned to time travel, or, data pipelining and scheduling with Airflow

PyData

Airflow for Beginners

Varya Karpenko

****UPDATE: Project is now open sourced at https://www.github.com/industrydive/fileflow**** From Pydata DC 2016 Description Data warehousing and analytics projects can, like ours, start out small - and fragile. With an organically growing mess of scripts glued together and triggered by cron jobs hiding on different servers, we needed better plumbing. After perusing the data pipelining landscape, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool from Airbnb. Abstract The power of any reporting tool breaks based on the data behind it, so when our data warehousing process got too big for its humble origins, we searched for something better. After testing out several options such as Drake, Pydoit, Luigi, AWS Data Pipeline, and Pinball, we landed on Airflow, an Apache incubating batch processing pipelining and scheduler tool originating from Airbnb, that provides the benefits of pipeline construction as directed acyclic graphs (DAGs), along with a scheduler that can handle alerting, retries, callbacks and more to make your pipeline robust. This talk will discuss the value of DAG based pipelines for data processing workflows, highlight useful features in all of the pipelining projects we tested, and dive into some of the specific challenges (like time travel) and successes (like time travel!) we’ve experienced using Airflow to productionize our data engineering tasks. By the end of this talk, you will learn - pros and cons of several Python-based/Python-supporting data pipelining libraries - the design paradigm behind Airflow, an Apache incubating data pipelining and scheduling service, and what it is good for - some epic fails to avoid and some epic wins to emulate from our experience porting our data engineering tasks to a more robust system - some quick-start tips for implementing Airflow at your organization.

How I learned to time travel, or, data pipelining and scheduling with Airflow

Laura Lorenz

Introducing Apache Airflow and how we are using it

Bruno Faria

Airflow presentation

Anant Corporation

Building Better Data Pipelines using Apache Airflow

Sid Anand

Talk given at Bio-IT 2016, Cloud Computing track Abstract: As bioinformatics scientists, we tend to write custom tools for managing our workflows, even when viable, open-source alternatives are available from the tech community. Our field has, however, begun to adopt Docker containers to stabilize compute environments. In this talk, I will introduce Luigi, a workflow system built by engineers at Spotify to manage long-running big data processing jobs with complex dependencies. Focusing on a case study of next generation sequencing analysis in cancer genomics research, I will show how Luigi can connect simple, containerized applications into complex bioinformatics pipelines that can be easily integrated with compute, storage, and data warehousing on the cloud.

Building cloud-enabled genomics workflows with Luigi and Docker

Jacob Feala

Airflow presentation

Ilias Okacha

It's difficult to find off-the-shelf, open-source solutions for creating lean, simple, and language-agnostic data-processing pipelines for machine learning (ML). This session shows you how to use Amazon S3, Docker, Amazon EC2, Auto Scaling, and a number of open source libraries as cornerstones to build one. We also share our experience creating elastically scalable and robust ML infrastructure leveraging the Spot instance market.

(CMP310) Data Processing Pipelines Using Containers & Spot Instances

Amazon Web Services

Working in a cloud or on-premises environment, we all somehow move data from A to B on-demand or on schedule. It is essential to have a tool that can automate recurring workflows. This can be anything from an ETL(Extract, Transform, and Load) job for a regular analytics report all the way to automatically re-training a machine learning model. In this talk, we will introduce Apache Airflow and how it can help orchestrate your workflows. We will cover key concepts, features, and use cases of Apache Airflow, as well as how you can enjoy Apache Airflow on GCP and AWS by demo-ing a few practical workflows.

Orchestrating workflows Apache Airflow on GCP & AWS

Derrick Qin

Apache Airflow

Knoldus Inc.

What is Spark

Bruno Faria

Workflow Engines + Luigi

Vladislav Supalov

Clearing Airflow Obstructions

Tatiana Al-Chueyr

Powering machine learning workflows with Apache Airflow and Python

Tatiana Al-Chueyr

Building the next generation Spark SQL engine at speed poses new challenges to both automation and testing. At Databricks, we are implementing a new testing framework for assessing the quality and performance of new developments as they produced. Having more than 1,200 worldwide contributors, Apache Spark follows a rapid pace of development. At this scale, new testing tooling such as random query and data generation, fault injection, longevity stress, and scalability tests are essential to guarantee a reliable and performance Spark later in production. By applying such techniques, we will demonstrate the effectiveness of our testing infrastructure by drilling-down into cases where correctness and performance regressions have been found early. In addition, showing how they have been root-caused and fixed to prevent regressions in production and boosting the continuous delivery of new features.

Fast and Reliable Apache Spark SQL Engine

Databricks

The mission of Uber is to make transportation as reliable as running water. The business is fundamentally driven by real-time data -- more than half of the employees in Uber, many of whom are non-technical, use SQL on a regular basis to analyze data and power their business decisions. We are building AthenaX, a stream processing platform built on top of Apache Flink to enable our users to write SQL to process real-time data efficiently and reliably at Uber's scale. Using Apache Calcite as query parser, AthenaX compiles the SQL down to Flink jobs. Leveraging Flink's unique streaming capabilities, AthenaX supports (1) consistent computations reliably thanks to at-least-once guarantees, (2) nontrivial analytics (e.g., windowing and joins) on multiple data sources, and (3) efficient and cost-effective executions in production through code generation and elastic scaling.

Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...

Flink Forward

Mais procurados (19)

Apache Airflow Architecture

Apache Airflow Introduction

How I learned to time travel, or, data pipelining and scheduling with Airflow

Airflow for Beginners

How I learned to time travel, or, data pipelining and scheduling with Airflow

Introducing Apache Airflow and how we are using it

Airflow presentation

Building Better Data Pipelines using Apache Airflow

Building cloud-enabled genomics workflows with Luigi and Docker

Airflow presentation

(CMP310) Data Processing Pipelines Using Containers & Spot Instances

Orchestrating workflows Apache Airflow on GCP & AWS

Apache Airflow

What is Spark

Workflow Engines + Luigi

Clearing Airflow Obstructions

Powering machine learning workflows with Apache Airflow and Python

Fast and Reliable Apache Spark SQL Engine

Flink Forward SF 2017: Bill Liu & Haohui Mai - AthenaX : Uber’s streaming pro...

Semelhante a Building Robust Pipelines with Airflow

Software testing

Enamul Haque

Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...

Cωνσtantίnoς Giannoulis

Concurrency in Eclipse: Best Practices and Gotchas

amccullo

Java Performance & Profiling

Isuru Perera

Google, quality and you

nelinger

This lab introduces students to an end-to-end example of applying a machine learning (ML) workflow to a materials science dataset to address a research problem. The lab aims at deepening the conceptual understanding of ML, and while procedural skills such as writing Python code are not the focus of this lab, students will gain experience with a number of standard open source packages by interacting with code snippets through the Jupyter Notebook format and describing what each essential command does.

INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE

IPutuAdiPratama

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...

Robert Grossman

Cerberus : Framework for Manual and Automated Testing (Web Application)

CIVEL Benoit

Cerberus_Presentation1

CIVEL Benoit

Coldbox developer training – session 4

Billie Berzinskas

Amazon Elastic Kubernetes Service (EKS)는 표준 Kubernetes 환경에서 실행되는 어플리케이션과 완벽히 호환됩니다. AWS상에서 Kubernetes 클러스터를 생성하고, 컨테이너 어플리케이션을 배포, 관리, 확장 및 로깅, 모니터링에 대한 실습과 함께, 최근 릴리즈된 AWS IAM 권한을 Pod에 할당하는 방법 등을 Amazon EKS에서 구현하는 과정을 진행합니다.

[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵

Amazon Web Services Korea

Software Testing Tecniques

ersanbilik

Create an architecture for web test automation

Elias Nogueira

Advances in Scientific Workflow Environments

Carole Goble

Here’s the thing about data: it’s sticky, often rigid, and rarely feels agile. Yes, there are patterns that help increase the agility of data. Ruby on Rails, for example, has data migrations that not just allow but actively encourage incremental schema design. However all too often we hear about relational databases becoming a defacto API between subsystems, and thus resistant to change. It’s not just relational databases. Even supposedly unstructured data stored as key value pairs can be difficult to change if every piece of code that uses the data has duplicated logic to manage the semantic meaning of the data. Further, in-database logic such as rules, stored procedures, or user defined functions can be remarkably difficult to unit test. Finally, there are the often strict data governance requirements that necessitate keeping tight authorization control. Application developers have a wealth of tools and practices available to support incremental delivery. System administrators have DevOps tools and practices to support repeatable, automated operations. Where are the equivalents for data-centric work? Bringing agility to your data strategy may feel like an impossible goal, but it is possible. In this talk we consider the various ways in which data can impede agility, and how to make data strategies more agile-friendly.

Agility for Data

Elisabeth Hendrickson

Using the Machine to predict Testability

Miguel Lopez

Resilience Engineering: A field of study, a community, and some perspective s...

John Allspaw

Java code coverage with JCov. Implementation details and use cases.

Alexandre (Shura) Iline

Data cleaning with the Kurator toolkit: Bridging the gap between conventional...

Timothy McPhillips

Generating test cases using UML Communication Diagram

Praveen Penumathsa

Semelhante a Building Robust Pipelines with Airflow (20)

Software testing

Lessons Learned in Software Development: QA Infrastructure – Maintaining Rob...

Concurrency in Eclipse: Best Practices and Gotchas

Java Performance & Profiling

Google, quality and you

INTRODUCTION TO MACHINE LEARNING FOR MATERIALS SCIENCE

AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...

Cerberus : Framework for Manual and Automated Testing (Web Application)

Cerberus_Presentation1

Coldbox developer training – session 4

[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵

Software Testing Tecniques

Create an architecture for web test automation

Advances in Scientific Workflow Environments

Agility for Data

Using the Machine to predict Testability

Resilience Engineering: A field of study, a community, and some perspective s...

Java code coverage with JCov. Implementation details and use cases.

Data cleaning with the Kurator toolkit: Bridging the gap between conventional...

Generating test cases using UML Communication Diagram

Mais de Erin Shellman

For retail businesses, inventory is simultaneously the company’s greatest asset and its largest risk. Competitor pricing, markdowns, and returns all threaten the margins that drive success, and in practice, inventory doesn’t always move according to plan. To win in this highly competitive and rapidly evolving industry, it’s essential to have a flexible toolkit that accurately produces forecasts and intelligently adapts to unplanned inventory dynamics. In this talk, I’ll outline how Nordstrom applies data science and machine learning to build a wholistic view of inventory management from assortment, through stocking with intelligent size runs, and ending with a customer experience that gets the right product to the right customer at the right time.

Case studies in data-driven merchandising

Erin Shellman

High-throughput screening (HTS) is a technique applied in drug discovery and synthetic biology to screen thousands of 'candidates' (e.g. molecules, cells, strains) for expression of interesting traits. In exchange for throughput, we often pay a penalty in terms of sample size and variance. Small sample sizes and highly variable measurements can lower our ability to detect valuable improvements. In this talk, I describe a simulation environment for conducting power analysis and describe special considerations for maintaining high power with a moving target.

Catching the most with high-throughput screening

Erin Shellman

This is a short talk I delivered as part of the Michigan Institute for Data Science kick-off event at the University of Michigan. Data science is an emergent field that incorporates concepts from statistics, computer science and machine learning to create and apply knowledge from data. In this talk I’ll share what I think are the essential skills and characteristics of effective data scientists. I’ll also provide guidance on how students can develop those skills in school and how educators can prepare them for jobs in industry.

Developing effective data scientists

Erin Shellman

Like many Internet giants Twitter makes money by selling ads, but they’ve got an insidious infestation eroding their advertising credibility: bots. More than 23 million of them. Twitter bots are automatons living in the Twittersphere and ranging wildly in capability. In their simplest form, they follow you maybe fav-ing or retweeting your statuses. At their most complex, they troll and ironically, troll trolls using speech patterns that can, at times, fool humans. But when advertisers pay for engagement, they aren’t interested in a four-hour flame war between a gamergate bot and a Kanye bot. When advertisers analyze social data they want to be sure their findings are the result of human activity. In Bot or Not I describe an end-to-end data analysis to build a classifier with Python.

Bot or Not

Erin Shellman

Downloading the internet with Python + Scrapy

Erin Shellman

Fun! with the Twitter API

Erin Shellman

Businesses are inundated with options of vendors who provide analytical products. We're told in no uncertain terms that without real-time analytical capabilities we are being out- strategized, out-advertised and out-personalized with every passing second. "Real-time real talk" is a critical survey of real-time systems and their application in the areas of retail and marketing. We'll cut through the hype and get to the heart of what it means to be real-time, and explore some of the real-time systems being built today in the Nordstrom Data Lab.

real time real talk

Erin Shellman

Collaborative Filtering for fun ...and profit!

Erin Shellman

Predicting the future is hard and it requires a lot of assumptions, also known as beliefs, also known as faith. In “Assumptions: Check yo self, before you wreck yo self” we explore the consequences of beliefs when constructing predictive models. We’ll walk through the process of developing a demand forecast for Evo, a Seattle-based outdoor recreation retailer, and discuss how assumptions influence the behavior of your application and ultimately the decisions you make.

Assumptions: Check yo'self before you wreck yourself

Erin Shellman

Mais de Erin Shellman (9)

Case studies in data-driven merchandising

Catching the most with high-throughput screening

Developing effective data scientists

Bot or Not

Downloading the internet with Python + Scrapy

Fun! with the Twitter API

real time real talk

Collaborative Filtering for fun ...and profit!

Assumptions: Check yo'self before you wreck yourself

Último

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Rafal Los

Axa Assurance Maroc - Insurer Innovation Award 2024

The Digital Insurer

Join our latest Connector Corner webinar to discover how UiPath Integration Service revolutionizes API-centric automation in a 'Quote to Cash' process—and how that automation empowers businesses to accelerate revenue generation. A comprehensive demo will explore connecting systems, GenAI, and people, through powerful pre-built connectors designed to speed process cycle times. Speakers: James Dickson, Senior Software Engineer Charlie Greenberg, Host, Product Marketing Manager

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

DianaGray10

How to Troubleshoot Apps for the Modern Connected Worker

ThousandEyes

💉💊+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI}}+971581248768 +971581248768 Mtp-Kit (500MG) Prices » Dubai [(+971581248768**)] Abortion Pills For Sale In Dubai, UAE, Mifepristone and Misoprostol Tablets Available In Dubai, UAE CONTACT DR.Maya Whatsapp +971581248768 We Have Abortion Pills / Cytotec Tablets /Mifegest Kit Available in Dubai, Sharjah, Abudhabi, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE, Buy cytotec in Dubai +971581248768''''Abortion Pills near me DUBAI | ABU DHABI|UAE. Price of Misoprostol, Cytotec” +971581248768' Dr.DEEM ''BUY ABORTION PILLS MIFEGEST KIT, MISOPROTONE, CYTOTEC PILLS IN DUBAI, ABU DHABI,UAE'' Contact me now via What's App…… abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain Above all, Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg abortion pill in Dubai, UAE. Medication abortion offers an alternative to Surgical Abortion for women in the early weeks of pregnancy. We only offer abortion pills from 1 week-6 Months. We then advise you to use surgery if its beyond 6 months. Our Abu Dhabi, Ajman, Al Ain, Dubai, Fujairah, Ras Al Khaimah (RAK), Sharjah, Umm Al Quwain (UAQ) United Arab Emirates Abortion Clinic provides the safest and most advanced techniques for providing non-surgical, medical and surgical abortion methods for early through late second trimester, including the Abortion By Pill Procedure (RU 486, Mifeprex, Mifepristone, early options French Abortion Pill), Tamoxifen, Methotrexate and Cytotec (Misoprostol). The Abu Dhabi, United Arab Emirates Abortion Clinic performs Same Day Abortion Procedure using medications that are taken on the first day of the office visit and will cause the abortion to occur generally within 4 to 6 hours (as early as 30 minutes) for patients who are 3 to 12 weeks pregnant. When Mifepristone and Misoprostol are used, 50% of patients complete in 4 to 6 hours; 75% to 80% in 12 hours; and 90% in 24 hours. We use a regimen that allows for completion without the need for surgery 99% of the time. All advanced second trimester and late term pregnancies at our Tampa clinic (17 to 24 weeks or greater) can be completed within 24 hours or less 99% of the time without the need surgery. The procedure is completed with minimal to no complications. Our Women's Health Center located in Abu Dhabi, United Arab Emirates, uses the latest medications for medical abortions (RU-486, Mifeprex, Mifegyne, Mifepristone, early options French abortion pill), Methotrexate and Cytotec (Misoprostol). The safety standards of our Abu Dhabi, United Arab Emirates Abortion Doctors remain unparalleled. They consistently maintain the lowest complication rates throughout the nation. Our Physicians and staff are always available to answer questions and care for women in one of the most difficult times in their lives. The decision to have an abortion at the Abortion Cl

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...

?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Boost Fertility New Invention Ups Success Rates.pdf

sudhanshuwaghmare1

Histor y of HAM Radio presentation slide

vu2urc

MySQL Webinar, presented on the 25th of April, 2024. Summary: MySQL solutions enable the deployment of diverse Database Architectures tailored to specific needs, including High Availability, Disaster Recovery, and Read Scale-Out. With MySQL Shell's AdminAPI, administrators can seamlessly set up, manage, and monitor these solutions, ensuring efficiency and ease of use in their administration. MySQL Router, on the other hand, provides transparent routing from the application traffic to the backend servers in the architectures, requiring minimal configuration. Completely built in-house and supported by Oracle, these solutions have been adopted by enterprises of all sizes for their business-critical applications. In this presentation, we'll delve into various database architecture solutions to help you choose the right one based on your business requirements. Focusing on technical details and the latest features to maximize the potential of these solutions.

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Miguel Araújo

Advantages of Hiring UIUX Design Service Providers for Your Business

Pixlogix Infotech

Building Digital Trust in a Digital Economy Veronica Tan, Director - Cyber Security Agency of Singapore Apidays Singapore 2024: Connecting Customers, Business and Technology (April 17 & 18, 2024) ------ Check out our conferences at https://www.apidays.global/ Do you want to sponsor or talk at one of our conferences? https://apidays.typeform.com/to/ILJeAaV8 Learn more on APIscene, the global media made by the community for the community: https://www.apiscene.io Explore the API ecosystem with the API Landscape: https://apilandscape.apiscene.io/

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

apidays

Scaling API-first – The story of a global engineering organization

Radu Cotescu

What are drone anti-jamming systems? The drone anti-jamming systems and anti-spoof technology protect against interference, jamming, and spoofing of the UAVs. To protect their security, countries are beginning to research drone anti-jamming systems, also known as drone strike weapons. The anti-jam and anti-spoof technology protects against interference, jamming and spoofing. A drone strike weapon is a drone attack weapon that can attack and destroy enemy drones. So what is so unique about this amazing system?

What Are The Drone Anti-jamming Systems Technology?

Antenna Manufacturer Coco

This presentation explores the impact of HTML injection attacks on web applications, detailing how attackers exploit vulnerabilities to inject malicious code into web pages. Learn about the potential consequences of such attacks and discover effective mitigation strategies to protect your web applications from HTML injection vulnerabilities. for more information visit https://bostoninstituteofanalytics.org/category/cyber-security-ethical-hacking/

HTML Injection Attacks: Impact and Mitigation Strategies

Boston Institute of Analytics

Three things you will take away from the session: • How to run an effective tenant-to-tenant migration • Best practices for before, during, and after migration • Tips for using migration as a springboard to prepare for Copilot in Microsoft 365 Main ideas: Migration Overview: The presentation covers the current reality of cross-tenant migrations, the triggers, phases, best practices, and benefits of a successful tenant migration Considerations: When considering a migration, it is important to consider the migration scope, performance, customization, flexibility, user-friendly interface, automation, monitoring, support, training, scalability, data integrity, data security, cost, and licensing structure Next Wave: The next wave of change includes the launch of Copilot, which requires businesses to be prepared for upcoming changes related to Copilot and the cloud, and to consolidate data and tighten governance ShareGate: ShareGate can help with pre-migration analysis, configurable migration tool, and automated, end-user driven collaborative governance

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

sammart93

This presentations targets students or working professionals. You may know Google for search, YouTube, Android, Chrome, and Gmail, but did you know Google has many developer tools, platforms & APIs? This comprehensive yet still high-level overview outlines the most impactful tools for where to run your code, store & analyze your data. It will also inspire you as to what's possible. This talk is 50 minutes in length.

Powerful Google developer tools for immediate impact! (2023-24 C)

wesley chun

presentation ICT roal in 21st century education

jfdjdjcjdnsjd

Enterprise Knowledge’s Urmi Majumder, Principal Data Architecture Consultant, and Fernando Aguilar Islas, Senior Data Science Consultant, presented "Driving Behavioral Change for Information Management through Data-Driven Green Strategy" on March 27, 2024 at Enterprise Data World (EDW) in Orlando, Florida. In this presentation, Urmi and Fernando discussed a case study describing how the information management division in a large supply chain organization drove user behavior change through awareness of the carbon footprint of their duplicated and near-duplicated content, identified via advanced data analytics. Check out their presentation to gain valuable perspectives on utilizing data-driven strategies to influence positive behavioral shifts and support sustainability initiatives within your organization. In this session, participants gained answers to the following questions: - What is a Green Information Management (IM) Strategy, and why should you have one? - How can Artificial Intelligence (AI) and Machine Learning (ML) support your Green IM Strategy through content deduplication? - How can an organization use insights into their data to influence employee behavior for IM? - How can you reap additional benefits from content reduction that go beyond Green IM?

Driving Behavioral Change for Information Management through Data-Driven Gree...

Enterprise Knowledge

Discord is a free app offering voice, video, and text chat functionalities, primarily catering to the gaming community. It serves as a hub for users to create and join servers tailored to their interests. Discord’s ecosystem comprises servers, each functioning as a distinct online community with its own channels dedicated to specific topics or activities. Users can engage in text-based discussions, voice calls, or video chats within these channels. Understanding Discord Servers Discord servers are virtual spaces where users congregate to interact, share content, and build communities. Servers may revolve around gaming, hobbies, interests, or fandoms, providing a platform for like-minded individuals to connect. Communication Features Discord offers a range of communication tools, including text channels for messaging, voice channels for real-time audio conversations, and video channels for face-to-face interactions. These features facilitate seamless communication and collaboration. What Does NSFW Mean? The acronym NSFW stands for “Not Safe For Work,” indicating content that may be inappropriate for professional or public settings. NSFW Content NSFW content encompasses material that is sexually explicit, violent, or otherwise graphic in nature. It often includes nudity, profanity, or depictions of sensitive topics.

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

UK Journal

Finology Group – Insurtech Innovation Award 2024

The Digital Insurer

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Digital Insurer

Building Robust Pipelines with Airflow

1. @erinshellman Wrangle Conf July 20th, 2017 Building Robust Pipelines with Airﬂow

2. Zymology: is the science of fermentation and it’s applied to make materials and molecules ! " # $ Beer Insulin Food additives Plastics

4. Zymergen provides a platform for rapid improvement of microbial strains through genetic engineering.

5. Robotic automation Our experimentation is increasingly orchestrated with robotics and machine learning.

6. Learning how to efﬁciently navigate the genome is the mission of data science at Zymergen

7. Blocker: process failure Orchestrating complex experiments with robots is hard, and there are process failures. These failures often cause sporadic, extreme measurement values.

8. Blocker: batch effects We see temporal effects based on when experiments were performed

9. Blocker: different interpretations of results We’re building a platform that can support any microbe and any molecule. Sometimes that results in a proliferation of solutions with disagreement on which is best.

10. Processing pipeline 1.Identify process failures 2.Quantify and remove process- related bias 3.Identify strains that show improvement using consistent criteria Clean model inputs Outlier detection Normalization Hit detection

11. Rolling our own ETL pipeline There are many ways to measure the concentration of a molecule. Any microbe, any molecule… any experiment, many data formats.

12. Describing complex processing dependencies is hard. Rolling our own ETL pipeline

13. Airflow https://airflow.incubator.apache.org/ “Airflow is a platform to programmatically author, schedule and monitor workflows.” Airflow gives us flexibility to apply a common set of processing steps to variable data inputs, schedule complex processing workflows, and has become a delivery mechanism for our products.

14. Structure and Flexibility

15. e.g. Normalization Airﬂow workﬂows are described as directed acyclic graphs (DAGs). Each task node in the DAG is an operator.

16. The anatomy of a DAG Custom operators Ordering Instantiate DAG

17. Modularity and ﬂexibility

18. Airﬂow + PyStan With Bayesian hierarchical models we estimate (and monitor) the distribution of batch effects. Experimental bias

19. DropBox • Scientists at Zymergen work with data using many different tools including JMP, SQL, and Excel. • We use a custom DropBox hook to make quick data ingestion pipelines.

20. Alerting / Communication

21. 3rd-party hooks & operators

22. Operator

23. Pairs well with Superset! “Apache Superset is a modern, enterprise-ready business intelligence web application” https://github.com/apache/incubator-superset

24. Constructing machine learning workﬂows

25. Fairflow: Functional Airflow • The core of Fairflow is an abstract base class foperator that takes care of instantiating your Airflow operators and setting their dependencies. • In Fairflow, DAGs are constructed from foperators that create the upstream operators when the final foperator is called.

26. Conﬁguring complex ML workﬂows… functionally

27. Defining ML workflows In the DAG definition, create an instance of the task. Then, instantiate a DAG like usual and call the compare task on the DAG.

28. Defining ML workflows The design allows for simple creation of complicated experimental workflows with arbitrary sets of models, parameters, and evaluation metrics.

29. Is Airﬂow for you? Do you have heterogeneous data sources? Do you have complex dependencies between processing tasks? Do you have data with different velocities? Do you have constraints on your time? Probably!

30. Thanks team! %% & '() * +

Building Robust Pipelines with Airflow

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (19)

Semelhante a Building Robust Pipelines with Airflow

Semelhante a Building Robust Pipelines with Airflow (20)

Mais de Erin Shellman

Mais de Erin Shellman (9)

Último

Último (20)

Building Robust Pipelines with Airflow