Data security is rapidly gaining importance as the volume of data companies collect, analyze and monetize grows exponentially. New data processing tools and platforms are emerging at an increasing rate, as are the ways in which an organization consumes data. In this presentation Mukund Sarma and Feni Chawla talk about the unique technical and cultural challenges of running a data security program and share some practical solutions that have worked well at our company.
These slides were presented at the BSides Seattle 2024 conference.
Speaker: Venkatesh Umaashankar
LinkedIn: https://www.linkedin.com/in/venkateshumaashankar/
What will be discussed?
What is Data Science?
Types of data scientists
What makes a Data Science Team? Who are its members?
Why does a DS team need Full Stack Developer?
Who should lead the DS Team
Building a Data Science team in a Startup Vs Enterprise
Case studies on:
Evolution Of Airbnb’s DS Team
How Facebook on-boards DS team and trains them
Apple’s Acqui-hiring Strategy to build DS team
Spotify -‘Center of Excellence’ Model
Who should attend?
Managers
Technical Leaders who want to get started with Data Science
User management - the next-gen of authentication meetup 27012022lior mazor
Authentication is evolving. Customers are expecting much more from the user management experience in applications they are using today. Join us virtually for our upcoming "User Management - the next-gen of Authentication" meetup to learn about the secrets of building user management the right way, the secure way.
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
Getting from raw data to deploying data-driven solutions requires technology, data, and people. All of which exist. So why aren’t we seeing more truly data-driven companies: what's missing and why? During Strata Hadoop World Singapore 2015, Pauline Brown, Director of Marketing at Dataiku, explains how lack of collaboration is what is keeping companies from building and deploying data products effectively. Learn more about Dataiku and Data Science Studio: www.dataiku.com
Transform Banking with Big Data and Automated Machine Learning 9.12.17Cloudera, Inc.
Banks are rich in valuable data and can build and maintain a competitive advantage by identifying and executing on high-value machine learning projects leveraging the rich data available.This webinar will describe use cases fit for big data and machine learning in the banking sector (commercial, consumer, regulatory, and markets) and the impact they can have for your organization.
3 things to learn:
* How to create a next generation data platform and why it is important
* How to monetize big data using predictive modeling and machine learning
* What is needed for automated machine learning as a sustainable, cost-effective, and efficient solution
apidays LIVE Paris 2021 - Data privacy in the era of cloud native app by Guil...apidays
This document discusses data privacy in cloud-native applications. It defines key concepts like data privacy, data security, and privacy engineering. It outlines the 7 principles of GDPR including lawfulness, fairness, transparency, and others. It notes that data privacy laws are expanding globally and becoming crucial for business as more developers build features using microservices and third parties. Engineering teams are increasingly complex and privacy teams cannot keep up with mapping data flows, documenting privacy controls, and identifying risks. The document introduces Bearer, a tool that can catalog engineering components, map personal data flows, and automatically trigger risk assessments to help privacy engineers monitor privacy risks continuously without code changes.
The document discusses cyber security and information systems. It covers topics like the types of information systems, components of an information system, development of information systems, introduction to information security and the CIA triad, and the need for information security. The presenter, Mrs. Nidhi Rastogi, discusses these topics in detail over several slides.
Unlocking AI Potential: Leveraging PIA Processes for Comprehensive Impact Ass...TrustArc
Artificial Intelligence (AI) has emerged as a transformative force in various industries, from healthcare to finance and beyond. While AI offers incredible opportunities, it also raises ethical, legal, and social challenges that must be addressed. To navigate this complex landscape in the world of privacy, it is crucial to conduct comprehensive Privacy Impact Assessments (PIAs).
Conducting PIAs in this dynamic and evolving world of AI has brought new challenges to the privacy world. With AI increasingly being integrated into different areas of our lives, understanding the intersection between AI and PIAs is essential for any organization to ensure they are privacy forward.
Take advantage of this opportunity to gain a comprehensive understanding of AI impact assessments and their role in shaping the future of AI. In this insightful webinar, our experts will explore the power of Privacy Impact Assessments (PIAs) in ensuring responsible AI development and deployment.
In this webinar, some key topics that will be covered include:
- Introduction to AI PIAs
- PIAs demystified (why they are essential in the context of AI)
- Explore the evolving legal and regulatory landscape governing AI and privacy, including GDPR, CCPA, and other international standards
- Best practices for conducting effective PIAs in AI projects
- Future outlooks for AI and PIAs
Data science involves extracting meaningful insights from raw data through scientific methods and algorithms. It is an interdisciplinary field that focuses on analyzing large datasets using skills from computer science, mathematics, and statistics. Python is a commonly used programming language for data science due to its powerful libraries for tasks like data analysis, machine learning, and visualization. Key Python libraries include NumPy, Pandas, Matplotlib, Scikit-learn, and SciPy. The document then discusses tools, applications, and basic concepts in data science and Python.
Speaker: Venkatesh Umaashankar
LinkedIn: https://www.linkedin.com/in/venkateshumaashankar/
What will be discussed?
What is Data Science?
Types of data scientists
What makes a Data Science Team? Who are its members?
Why does a DS team need Full Stack Developer?
Who should lead the DS Team
Building a Data Science team in a Startup Vs Enterprise
Case studies on:
Evolution Of Airbnb’s DS Team
How Facebook on-boards DS team and trains them
Apple’s Acqui-hiring Strategy to build DS team
Spotify -‘Center of Excellence’ Model
Who should attend?
Managers
Technical Leaders who want to get started with Data Science
User management - the next-gen of authentication meetup 27012022lior mazor
Authentication is evolving. Customers are expecting much more from the user management experience in applications they are using today. Join us virtually for our upcoming "User Management - the next-gen of Authentication" meetup to learn about the secrets of building user management the right way, the secure way.
The 3 Key Barriers Keeping Companies from Deploying Data Products Dataiku
Getting from raw data to deploying data-driven solutions requires technology, data, and people. All of which exist. So why aren’t we seeing more truly data-driven companies: what's missing and why? During Strata Hadoop World Singapore 2015, Pauline Brown, Director of Marketing at Dataiku, explains how lack of collaboration is what is keeping companies from building and deploying data products effectively. Learn more about Dataiku and Data Science Studio: www.dataiku.com
Transform Banking with Big Data and Automated Machine Learning 9.12.17Cloudera, Inc.
Banks are rich in valuable data and can build and maintain a competitive advantage by identifying and executing on high-value machine learning projects leveraging the rich data available.This webinar will describe use cases fit for big data and machine learning in the banking sector (commercial, consumer, regulatory, and markets) and the impact they can have for your organization.
3 things to learn:
* How to create a next generation data platform and why it is important
* How to monetize big data using predictive modeling and machine learning
* What is needed for automated machine learning as a sustainable, cost-effective, and efficient solution
apidays LIVE Paris 2021 - Data privacy in the era of cloud native app by Guil...apidays
This document discusses data privacy in cloud-native applications. It defines key concepts like data privacy, data security, and privacy engineering. It outlines the 7 principles of GDPR including lawfulness, fairness, transparency, and others. It notes that data privacy laws are expanding globally and becoming crucial for business as more developers build features using microservices and third parties. Engineering teams are increasingly complex and privacy teams cannot keep up with mapping data flows, documenting privacy controls, and identifying risks. The document introduces Bearer, a tool that can catalog engineering components, map personal data flows, and automatically trigger risk assessments to help privacy engineers monitor privacy risks continuously without code changes.
The document discusses cyber security and information systems. It covers topics like the types of information systems, components of an information system, development of information systems, introduction to information security and the CIA triad, and the need for information security. The presenter, Mrs. Nidhi Rastogi, discusses these topics in detail over several slides.
Unlocking AI Potential: Leveraging PIA Processes for Comprehensive Impact Ass...TrustArc
Artificial Intelligence (AI) has emerged as a transformative force in various industries, from healthcare to finance and beyond. While AI offers incredible opportunities, it also raises ethical, legal, and social challenges that must be addressed. To navigate this complex landscape in the world of privacy, it is crucial to conduct comprehensive Privacy Impact Assessments (PIAs).
Conducting PIAs in this dynamic and evolving world of AI has brought new challenges to the privacy world. With AI increasingly being integrated into different areas of our lives, understanding the intersection between AI and PIAs is essential for any organization to ensure they are privacy forward.
Take advantage of this opportunity to gain a comprehensive understanding of AI impact assessments and their role in shaping the future of AI. In this insightful webinar, our experts will explore the power of Privacy Impact Assessments (PIAs) in ensuring responsible AI development and deployment.
In this webinar, some key topics that will be covered include:
- Introduction to AI PIAs
- PIAs demystified (why they are essential in the context of AI)
- Explore the evolving legal and regulatory landscape governing AI and privacy, including GDPR, CCPA, and other international standards
- Best practices for conducting effective PIAs in AI projects
- Future outlooks for AI and PIAs
Data science involves extracting meaningful insights from raw data through scientific methods and algorithms. It is an interdisciplinary field that focuses on analyzing large datasets using skills from computer science, mathematics, and statistics. Python is a commonly used programming language for data science due to its powerful libraries for tasks like data analysis, machine learning, and visualization. Key Python libraries include NumPy, Pandas, Matplotlib, Scikit-learn, and SciPy. The document then discusses tools, applications, and basic concepts in data science and Python.
Presented at SplunkLive! Paris 2018: Get More From Your Machine Data With Splunk AI
- Why AI & Machine Learning?
- What is Machine Learning?
- Splunk's Machine Learning Tour
- Use Cases & Customer Stories
Protecting endpoints from targeted attacksAppSense
This document discusses strategies for protecting endpoints from targeted attacks. It begins with an overview of the increasing threats facing organizations from malware and cyber attacks. It then outlines five principles for an effective endpoint security strategy: 1) get organizational endpoints in order through vulnerability management and application control, 2) focus on protecting data rather than infrastructure on unmanaged devices, 3) utilize thin clients and cloud-based solutions, 4) implement a zero-trust approach to authentication, and 5) maintain visibility into endpoint activity. The document recommends implementing application control, patching vulnerabilities, deploying recommended security practices, improving authentication, and integrating network and endpoint security controls. It emphasizes continuing to shift focus to securing unmanaged devices by decoupling protection from infrastructure.
Top learnings from evaluating and implementing a DLP Solution Priyanka Aash
This document provides an executive summary of Escorts IT's Data Loss Prevention (DLP) project review. It begins with background on Escorts, a 65-year-old Indian engineering company with four divisions and a combined turnover of Rs. 5000 crores. It then outlines three key data security challenges Escorts faces around data location, movement, and policy enforcement. The DLP project aims to address these by securing data across devices and networks. Implementation involved evaluating vendors, piloting solutions, establishing governance, training users on classification, and integrating with existing systems. Key learnings emphasized treating DLP as a business rather than IT project and properly managing change.
SplunkLive! Paris 2018: Legacy SIEM to SplunkSplunk
Presented at SplunkLive! Paris 2018: Legacy SIEM to Splunk, How to Conquer Migration and Not Die Trying:
- Why?
- SIEM Replacement
- Use Cases
- Data Sources & Data Onboarding
- Architecture
- Third Party Integrations
- You Got This
-
Data Science involves extracting insights from vast amounts of data using scientific methods and algorithms. It includes concepts like Statistics, Visualization, Machine Learning, and Deep Learning. The Data Science process goes through steps like Discovery, Preparation, Modeling, and Communication. Important roles include Data Scientist, Engineer, Analyst, and Statistician. Tools include R, SQL, Python, and SAS. Applications are in search, recommendations, recognition, gaming, and pricing. The main challenge is the variety of information and data required.
This document discusses fundamentals of IoT data analytics. It defines IoT analytics and explains challenges including dealing with large amounts of data, security issues, and misbehaving devices. It categorizes IoT data as either structured or unstructured, and as data in motion or at rest. Structured data fits a predefined model while unstructured data lacks structure. Data in motion passes through networks while data at rest is stored. Both predictive and prescriptive analytics provide more value but are more complex than descriptive or diagnostic analysis. Class activities involve capturing IoT data examples and presenting categorization and challenges.
This document discusses protecting clients' data and brand reputation by tackling data security issues. It identifies top worries like social media, ineffective patching, and email. It asks key questions about data management and outlines that data loss prevention alone cannot fully protect data both on and off premises from human and physical factors. It emphasizes that the largest challenge is fixing the human element through extensive user training and fostering a security-focused culture. It also stresses the need for a holistic approach combining technical controls and user awareness training to securely protect data at all stages.
TechWise with Eric Kavanagh, Dr. Robin Bloor and Dr. Kirk Borne
Live Webcast on July 23, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=59d50a520542ee7ed00a0c38e8319b54
Analytical applications are everywhere these days, and for good reason. Organizations large and small are using analytics to better understand any aspect of their business: customers, processes, behaviors, even competitors. There are several critical success factors for using analytics effectively: 1) know which kind of apps make sense for your company; 2) figure out which data sets you can use, both internal and external; 3) determine optimal roles and responsibilities for your team; 4) identify where you need help, either by hiring new employees or using consultants 5) manage your program effectively over time.
Register for this episode of TechWise to learn from two of the most experienced analysts in the business: Dr. Robin Bloor, Chief Analyst of The Bloor Group, and Dr. Kirk Borne, Data Scientist, George Mason University. Each will provide their perspective on how companies can address each of the key success factors in building, refining and using analytics to improve their business. There will then be an extensive Q&A session in which attendees can ask detailed questions of our experts and get answers in real time. Registrants will also receive a consolidated deck of slides, not just from the main presenters, but also from a variety of software vendors who provide targeted solutions.
Visit InsideAnlaysis.com for more information.
Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale. It allows organizations to collect massive amounts of data and ensure the data is highly usable by data scientists and analysts. As data volumes continue to grow exponentially, data engineers are needed to process and channel data to enable fields like machine learning and deep learning.
Regulatory compliance mandates have historically focused on IT & endpoint security as the primary means to protect data. However, as our digital economy has increasingly become software dependent, standards bodies have dutifully added requirements as they relate to development and deployment practices. Enterprise applications and cloud-based services constantly store and transmit data; yet, they are often difficult to understand and assess for compliance.
This webcast will present a practical approach towards mapping application security practices to common compliance frameworks. It will discuss how to define and enact a secure, repeatable software development lifecycle (SDLC) and highlight activities that can be leveraged across multiple compliance controls. Topics include:
* Consolidating security and compliance controls
* Creating application security standards for development and operations teams
* Identifying and remediating gaps between current practices and industry accepted "best practices”
SplunkLive! Zurich 2018: Get More From Your Machine Data with Splunk & AISplunk
This presentation discusses how Splunk and machine learning can help organizations get more value from their machine data. It describes how machine learning can improve decision making, uncover hidden trends, alert on deviations, and forecast incidents. The presentation provides an overview of Splunk's machine learning capabilities, including search, packaged solutions, and the machine learning toolkit. It also showcases several customer use cases that have benefited from Splunk's machine learning offerings, such as network incident detection, security/fraud prevention, and optimizing operations.
The document provides legal disclaimers and information about sustainable cybersecurity practices. It discusses starting cybersecurity at the administration level by making it cultural rather than technical, based on needs rather than vendor features, iterative and continuous. It also discusses establishing a data protection steering committee and reducing reliance on people by ensuring responsibilities are understood and policies and processes are documented. The document provides recommendations on cybersecurity frameworks, controls, and best practices.
SplunkLive! Munich 2018: Get More From Your Machine Data Splunk & AISplunk
Presented at SplunkLive! Munich 2018:
- Why AI & Machine Learning?
- What is Machine Learning?
- Splunk's Machine Learning Tour
- Use Cases & Customer Stories
Agile Data Science is a lean methodology that is adopted from Agile Software Development. At the core it centers around people, interactions, and building minimally viable products to ship fast and often to solicit customer feedback. In this presentation, I describe how this work was done in the past with examples. Get started today with our help by visiting http://www.alpinenow.com
Data science involves extracting insights from vast amounts of data using scientific methods and algorithms. It includes concepts like statistics, visualization, machine learning, and deep learning. The data science process includes steps like data discovery, preparation, modeling, and operationalizing results. Important roles include data scientist, engineer, analyst, and statistician. Tools include R, SQL, Python, and SAS. Applications are in internet search, recommendations, image recognition, gaming, and price comparison. The main challenge is obtaining a high variety of information and data for accurate analysis.
Data Analytics Today - Data, Tech, and Regulation.pdfHendri Karisma
This document discusses analytics, data, technology, and regulation. It begins with an introduction to Hendri Karisma and his role in data and analytics. It then defines data analytics and describes the main types: descriptive, diagnostic, predictive, and prescriptive analytics. The document outlines different data roles including data scientist, data analyst, data engineer, and AI/ML engineer. It emphasizes that building data and AI solutions requires expertise not just in science but also engineering and an understanding of relevant regulations to ensure systems are secure, trusted and reliable.
Data - Science and Engineering slide at Bandungpy Sharing SessionHendri Karisma
This document discusses data science and engineering roles. It defines data scientist and data engineer roles. Data scientists analyze large amounts of data to answer questions and drive organizational strategy, while data engineers build systems to collect, manage and transform raw data for analysis. The document also discusses the role of AI engineers, who develop complex algorithms and infrastructure for AI systems. It provides examples of responsibilities for each role and the data science experiment process.
The document summarizes key points from a presentation on privacy for tech startups. It discusses why privacy is important for startups to consider, providing practical information security controls startups can implement, and new privacy principles from the GDPR that startups should be aware of. Some highlights include:
- Privacy should be a priority from the start and can help startups win trust among users and investors.
- Practical security controls include encrypting data, patching systems, training employees, and monitoring for vulnerabilities.
- The GDPR introduces new principles like data protection by design, security of processing, breach notification requirements, data protection impact assessments, and data protection officers.
Presented at SplunkLive! Paris 2018: Get More From Your Machine Data With Splunk AI
- Why AI & Machine Learning?
- What is Machine Learning?
- Splunk's Machine Learning Tour
- Use Cases & Customer Stories
Protecting endpoints from targeted attacksAppSense
This document discusses strategies for protecting endpoints from targeted attacks. It begins with an overview of the increasing threats facing organizations from malware and cyber attacks. It then outlines five principles for an effective endpoint security strategy: 1) get organizational endpoints in order through vulnerability management and application control, 2) focus on protecting data rather than infrastructure on unmanaged devices, 3) utilize thin clients and cloud-based solutions, 4) implement a zero-trust approach to authentication, and 5) maintain visibility into endpoint activity. The document recommends implementing application control, patching vulnerabilities, deploying recommended security practices, improving authentication, and integrating network and endpoint security controls. It emphasizes continuing to shift focus to securing unmanaged devices by decoupling protection from infrastructure.
Top learnings from evaluating and implementing a DLP Solution Priyanka Aash
This document provides an executive summary of Escorts IT's Data Loss Prevention (DLP) project review. It begins with background on Escorts, a 65-year-old Indian engineering company with four divisions and a combined turnover of Rs. 5000 crores. It then outlines three key data security challenges Escorts faces around data location, movement, and policy enforcement. The DLP project aims to address these by securing data across devices and networks. Implementation involved evaluating vendors, piloting solutions, establishing governance, training users on classification, and integrating with existing systems. Key learnings emphasized treating DLP as a business rather than IT project and properly managing change.
SplunkLive! Paris 2018: Legacy SIEM to SplunkSplunk
Presented at SplunkLive! Paris 2018: Legacy SIEM to Splunk, How to Conquer Migration and Not Die Trying:
- Why?
- SIEM Replacement
- Use Cases
- Data Sources & Data Onboarding
- Architecture
- Third Party Integrations
- You Got This
-
Data Science involves extracting insights from vast amounts of data using scientific methods and algorithms. It includes concepts like Statistics, Visualization, Machine Learning, and Deep Learning. The Data Science process goes through steps like Discovery, Preparation, Modeling, and Communication. Important roles include Data Scientist, Engineer, Analyst, and Statistician. Tools include R, SQL, Python, and SAS. Applications are in search, recommendations, recognition, gaming, and pricing. The main challenge is the variety of information and data required.
This document discusses fundamentals of IoT data analytics. It defines IoT analytics and explains challenges including dealing with large amounts of data, security issues, and misbehaving devices. It categorizes IoT data as either structured or unstructured, and as data in motion or at rest. Structured data fits a predefined model while unstructured data lacks structure. Data in motion passes through networks while data at rest is stored. Both predictive and prescriptive analytics provide more value but are more complex than descriptive or diagnostic analysis. Class activities involve capturing IoT data examples and presenting categorization and challenges.
This document discusses protecting clients' data and brand reputation by tackling data security issues. It identifies top worries like social media, ineffective patching, and email. It asks key questions about data management and outlines that data loss prevention alone cannot fully protect data both on and off premises from human and physical factors. It emphasizes that the largest challenge is fixing the human element through extensive user training and fostering a security-focused culture. It also stresses the need for a holistic approach combining technical controls and user awareness training to securely protect data at all stages.
TechWise with Eric Kavanagh, Dr. Robin Bloor and Dr. Kirk Borne
Live Webcast on July 23, 2014
Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=59d50a520542ee7ed00a0c38e8319b54
Analytical applications are everywhere these days, and for good reason. Organizations large and small are using analytics to better understand any aspect of their business: customers, processes, behaviors, even competitors. There are several critical success factors for using analytics effectively: 1) know which kind of apps make sense for your company; 2) figure out which data sets you can use, both internal and external; 3) determine optimal roles and responsibilities for your team; 4) identify where you need help, either by hiring new employees or using consultants 5) manage your program effectively over time.
Register for this episode of TechWise to learn from two of the most experienced analysts in the business: Dr. Robin Bloor, Chief Analyst of The Bloor Group, and Dr. Kirk Borne, Data Scientist, George Mason University. Each will provide their perspective on how companies can address each of the key success factors in building, refining and using analytics to improve their business. There will then be an extensive Q&A session in which attendees can ask detailed questions of our experts and get answers in real time. Registrants will also receive a consolidated deck of slides, not just from the main presenters, but also from a variety of software vendors who provide targeted solutions.
Visit InsideAnlaysis.com for more information.
Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale. It allows organizations to collect massive amounts of data and ensure the data is highly usable by data scientists and analysts. As data volumes continue to grow exponentially, data engineers are needed to process and channel data to enable fields like machine learning and deep learning.
Regulatory compliance mandates have historically focused on IT & endpoint security as the primary means to protect data. However, as our digital economy has increasingly become software dependent, standards bodies have dutifully added requirements as they relate to development and deployment practices. Enterprise applications and cloud-based services constantly store and transmit data; yet, they are often difficult to understand and assess for compliance.
This webcast will present a practical approach towards mapping application security practices to common compliance frameworks. It will discuss how to define and enact a secure, repeatable software development lifecycle (SDLC) and highlight activities that can be leveraged across multiple compliance controls. Topics include:
* Consolidating security and compliance controls
* Creating application security standards for development and operations teams
* Identifying and remediating gaps between current practices and industry accepted "best practices”
SplunkLive! Zurich 2018: Get More From Your Machine Data with Splunk & AISplunk
This presentation discusses how Splunk and machine learning can help organizations get more value from their machine data. It describes how machine learning can improve decision making, uncover hidden trends, alert on deviations, and forecast incidents. The presentation provides an overview of Splunk's machine learning capabilities, including search, packaged solutions, and the machine learning toolkit. It also showcases several customer use cases that have benefited from Splunk's machine learning offerings, such as network incident detection, security/fraud prevention, and optimizing operations.
The document provides legal disclaimers and information about sustainable cybersecurity practices. It discusses starting cybersecurity at the administration level by making it cultural rather than technical, based on needs rather than vendor features, iterative and continuous. It also discusses establishing a data protection steering committee and reducing reliance on people by ensuring responsibilities are understood and policies and processes are documented. The document provides recommendations on cybersecurity frameworks, controls, and best practices.
SplunkLive! Munich 2018: Get More From Your Machine Data Splunk & AISplunk
Presented at SplunkLive! Munich 2018:
- Why AI & Machine Learning?
- What is Machine Learning?
- Splunk's Machine Learning Tour
- Use Cases & Customer Stories
Agile Data Science is a lean methodology that is adopted from Agile Software Development. At the core it centers around people, interactions, and building minimally viable products to ship fast and often to solicit customer feedback. In this presentation, I describe how this work was done in the past with examples. Get started today with our help by visiting http://www.alpinenow.com
Data science involves extracting insights from vast amounts of data using scientific methods and algorithms. It includes concepts like statistics, visualization, machine learning, and deep learning. The data science process includes steps like data discovery, preparation, modeling, and operationalizing results. Important roles include data scientist, engineer, analyst, and statistician. Tools include R, SQL, Python, and SAS. Applications are in internet search, recommendations, image recognition, gaming, and price comparison. The main challenge is obtaining a high variety of information and data for accurate analysis.
Data Analytics Today - Data, Tech, and Regulation.pdfHendri Karisma
This document discusses analytics, data, technology, and regulation. It begins with an introduction to Hendri Karisma and his role in data and analytics. It then defines data analytics and describes the main types: descriptive, diagnostic, predictive, and prescriptive analytics. The document outlines different data roles including data scientist, data analyst, data engineer, and AI/ML engineer. It emphasizes that building data and AI solutions requires expertise not just in science but also engineering and an understanding of relevant regulations to ensure systems are secure, trusted and reliable.
Data - Science and Engineering slide at Bandungpy Sharing SessionHendri Karisma
This document discusses data science and engineering roles. It defines data scientist and data engineer roles. Data scientists analyze large amounts of data to answer questions and drive organizational strategy, while data engineers build systems to collect, manage and transform raw data for analysis. The document also discusses the role of AI engineers, who develop complex algorithms and infrastructure for AI systems. It provides examples of responsibilities for each role and the data science experiment process.
The document summarizes key points from a presentation on privacy for tech startups. It discusses why privacy is important for startups to consider, providing practical information security controls startups can implement, and new privacy principles from the GDPR that startups should be aware of. Some highlights include:
- Privacy should be a priority from the start and can help startups win trust among users and investors.
- Practical security controls include encrypting data, patching systems, training employees, and monitoring for vulnerabilities.
- The GDPR introduces new principles like data protection by design, security of processing, breach notification requirements, data protection impact assessments, and data protection officers.
Semelhante a BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx (20)
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
artificial intelligence and data science contents.pptxGauravCar
What is artificial intelligence? Artificial intelligence is the ability of a computer or computer-controlled robot to perform tasks that are commonly associated with the intellectual processes characteristic of humans, such as the ability to reason.
› ...
Artificial intelligence (AI) | Definitio
Discover the latest insights on Data Driven Maintenance with our comprehensive webinar presentation. Learn about traditional maintenance challenges, the right approach to utilizing data, and the benefits of adopting a Data Driven Maintenance strategy. Explore real-world examples, industry best practices, and innovative solutions like FMECA and the D3M model. This presentation, led by expert Jules Oudmans, is essential for asset owners looking to optimize their maintenance processes and leverage digital technologies for improved efficiency and performance. Download now to stay ahead in the evolving maintenance landscape.
Embedded machine learning-based road conditions and driving behavior monitoringIJECEIAES
Car accident rates have increased in recent years, resulting in losses in human lives, properties, and other financial costs. An embedded machine learning-based system is developed to address this critical issue. The system can monitor road conditions, detect driving patterns, and identify aggressive driving behaviors. The system is based on neural networks trained on a comprehensive dataset of driving events, driving styles, and road conditions. The system effectively detects potential risks and helps mitigate the frequency and impact of accidents. The primary goal is to ensure the safety of drivers and vehicles. Collecting data involved gathering information on three key road events: normal street and normal drive, speed bumps, circular yellow speed bumps, and three aggressive driving actions: sudden start, sudden stop, and sudden entry. The gathered data is processed and analyzed using a machine learning system designed for limited power and memory devices. The developed system resulted in 91.9% accuracy, 93.6% precision, and 92% recall. The achieved inference time on an Arduino Nano 33 BLE Sense with a 32-bit CPU running at 64 MHz is 34 ms and requires 2.6 kB peak RAM and 139.9 kB program flash memory, making it suitable for resource-constrained embedded systems.
The CBC machine is a common diagnostic tool used by doctors to measure a patient's red blood cell count, white blood cell count and platelet count. The machine uses a small sample of the patient's blood, which is then placed into special tubes and analyzed. The results of the analysis are then displayed on a screen for the doctor to review. The CBC machine is an important tool for diagnosing various conditions, such as anemia, infection and leukemia. It can also help to monitor a patient's response to treatment.
Use PyCharm for remote debugging of WSL on a Windo cf5c162d672e4e58b4dde5d797...shadow0702a
This document serves as a comprehensive step-by-step guide on how to effectively use PyCharm for remote debugging of the Windows Subsystem for Linux (WSL) on a local Windows machine. It meticulously outlines several critical steps in the process, starting with the crucial task of enabling permissions, followed by the installation and configuration of WSL.
The guide then proceeds to explain how to set up the SSH service within the WSL environment, an integral part of the process. Alongside this, it also provides detailed instructions on how to modify the inbound rules of the Windows firewall to facilitate the process, ensuring that there are no connectivity issues that could potentially hinder the debugging process.
The document further emphasizes on the importance of checking the connection between the Windows and WSL environments, providing instructions on how to ensure that the connection is optimal and ready for remote debugging.
It also offers an in-depth guide on how to configure the WSL interpreter and files within the PyCharm environment. This is essential for ensuring that the debugging process is set up correctly and that the program can be run effectively within the WSL terminal.
Additionally, the document provides guidance on how to set up breakpoints for debugging, a fundamental aspect of the debugging process which allows the developer to stop the execution of their code at certain points and inspect their program at those stages.
Finally, the document concludes by providing a link to a reference blog. This blog offers additional information and guidance on configuring the remote Python interpreter in PyCharm, providing the reader with a well-rounded understanding of the process.
2. Who are we
Mukund Sarma
Senior Director of Product Security, Chime
● Security Engineer turned
manager
● Chime ← Credit Karma ←
Synopsys
● There are no Security problems
- They are all engineering and
culture problems!
Feni Chawla
Senior Security Engineer, Chime
● Data engineer turned security
engineer
● Chime ← Rally Health ←
Microsoft ← Teradata
● Passionate about keeping data
safe and user information
private
3. <¡Spoiler Alert!> How Ethan Hunt Steals Data from CIA
Gets past the following controls to get to the database:
I. Retinal scan
II. Double key card
III. Thermal and pressure sensors
IV. Laser rays
🗣 Feni
4. <¡Spoiler Alert!> How Ethan Hunt Steals Data from CIA
When he gets to the database:
🗣 Feni
5. <¡Spoiler Alert!> How Ethan Hunt Steals Data from CIA
When he gets to the database:
🗣 Feni
6. <¡Spoiler Alert!> How Ethan Hunt Steals Data from CIA
When he gets to the database:
🗣 Feni
7. <¡Spoiler Alert!> How Ethan Hunt Steals Data from CIA
When he gets to the database:
🗣 Feni
11. Agenda
● Defining Data Security
● Unraveling the Roles and Responsibilities within Data team
● Why working with Data teams is different for Security teams
● Practical challenges of running a Data Security program
● How we approached building a pragmatic Data Security program
● How does “STRIDE” look for Data Security
● Closing thoughts
● Questions
🗣 Feni
12. Things That Could Be Their Own Talks
(What’s not in scope for this talk)
● Privacy engineering is not in scope
● AI and the implications of AI in
engineering and Security
● “Data Perimeters” or concepts of that in
the world of Public Cloud
🗣 Feni
13. Definitions (or rather a very brief outline)
● Data warehouse - a very large database that stores integrated data from multiple sources
● Snowflake - a popular, SaaS data warehouse
● ETL - a process of extracting data from various sources, transforming it into a format suitable for
analysis, and loading it somewhere, typically to a data warehouse
● Looker - a reporting & visualization tool that can connect to any database or data warehouse
● Data Lake - a large centralized repository that stores vast amounts of data, typically in their native format
🗣 Feni
14. Defining Data Security
Data security is the practice of protecting
Data from unauthorized access,
corruption, or theft throughout its
lifecycle.
🗣 Mukund
16. What Makes Data Security Challenging
Data is often intangible
● It can mutate and be derived
● It can easily flow across boundaries
● No intrinsic constraints on data handling
🗣 Feni
Hi, how can I help you?
Can you provide your
order number?
I need help canceling
my last order
Sure, my SSN is 123-
45-6789
17. What Makes Data Security Challenging
Scope is wide, and growing
● Data is everywhere
● Ownership spans multiple teams
● Data team often has multiple functions & goals
🗣 Feni
18. What Makes Data Security Challenging
There aren’t many precedences one can learn from
● Traditional Security teams don’t understand the
data domain
● More often it's seen as a Compliance function
● Not enough “security” focused tutorials /
documentation
🗣 Mukund
Data Security
Infra
Security
App
Security
20. Engineers
Processing
Data Platform
Ops & SRE
Scientists
Modeling
ML & AI
Analysts Business Reporting
● Goal: Gather insights from datasets
● Datasets must be:
○ Acquired
○ Transformed
○ Maintained
● Insights must be:
○ Reliable
○ Timely
○ Easy to consume
○ Granular
Functional Roles Responsibilities
Unraveling Roles and Responsibilities within Data Team
🗣 Feni
21. Engineers
Processing
Data Platform
Ops & SRE
Scientists
Modeling
ML & AI
Analysts Business Reporting
Functional Roles Responsibilities
Unraveling Roles and Responsibilities within Data Team
Using data to improve marketing results
Build real-time view into
performance of digital ads by demographic
Improve performance amongst the 21-25 year
olds in metropolitans
Provide executive reporting on Q1 results
🗣 Feni
22. Engineers
Processing
Data Platform
Ops & SRE
Scientists
Modeling
ML & AI
Analysts Business Reporting
Unraveling Roles and Responsibilities within Data Team
Build real-time view into performance of digital ads by demographics
Campaign Data Proprietary Data
Data Lifecycle
Management
Indexing &
Cataloging
APIs &
Schedulers
Scrub, Normalize, Ingest
🗣 Feni
23. Engineers
Processing
Data Platform
Ops & SRE
Scientists
Modeling
ML & AI
Analysts Business Reporting
Unraveling Roles and Responsibilities within Data Team
Improve performance amongst the 21-25 year olds in metropolitans
Testing &
Sampling
Modeling
Analysis
Data Lake
🗣 Feni
24. Engineers
Processing
Data Platform
Ops & SRE
Scientists
Modeling
ML & AI
Analysts Business Reporting
Unraveling Roles and Responsibilities within Data Team
Provide executive reporting for Q1 results
Data Lake
Curated Data
🗣 Feni
25. Engineers
Processing
Data Platform
Ops & SRE
Scientists
Modeling
ML & AI
Analysts Business Reporting
Functional Roles Responsibilities
Bringing It All together From a Tooling Perspective
🗣 Feni
26. Working with data engineers is different for security teams
🗣 Mukund
27. Working with Data Engineering Teams is Different
from Infra and Software Engineering Teams
🗣 Mukund
Raise your hands if your company has:
- A dedicated security onboarding program
- A security training program that covers OWASP Top 10 or similar for
your engineers
- Built specific guardrails or tools for your developer
- A DevEx or DevRel team
28. Working with Data Engineering Teams is Different
from Infra and Software Engineering Teams
🗣 Mukund
Continue to keep your hands raised if:
- any of those were built keeping your data teams/data engineers in
mind
29. Working with Data Engineering Teams is Different
from Infra and Software Engineering Teams
Software & Infra Teams Data Teams
Ownership
Own the software, products and infra
systems they build and maintain
Typically do not own the data itself, just
process it or manage the underlying platform
Culture
Generally start with high specificity about
what they want to build or design
Generally start with low specificity and need
to explore the data to solidify requirements
Testing
Rarely need access to prod data, except
for specific troubleshooting purposes
Almost always need access to prod data
for modeling and validation
🗣 Mukund
30. Tooling - Common Security Tools Don’t Apply
Controls Product Engineering Data Teams
Tools the team uses
Most tools in use are mature and
established, especially in production
(eg Argo for K8s deployment, Rails,
etc)
No easy way to regulate tools
operating on datasets
Security Design Reviews
Clear, established practice coupled with
availability of frameworks like STRIDE
No established processes &
frameworks
Vulnerability Detection
Code scanners, SAST/DAST, pentests,
bug bounty etc
No way to detect problems in
tools and datasets
Asset Inventory
Generally easy to itemize services,
compute, infra, etc
No easy way to itemize datasets
and models
🗣 Mukund
31. Working with Data Engineering Teams is Different
from Infra and Software Engineering Teams
🗣 Mukund
We ought to come to terms with the fact that most companies and security
teams have not considered their data scientists and Data engineers as First
class citizens. This must change!
32. Practical challenges of running a data security program
IAM, Data Inventory (or lack thereof), SDLC, Culture
🗣 Feni
33. Practical Challenges of Running a Data Security Program
#1: IAM
● Most databases and platforms use their own
RBAC
● Securing service accounts is complex
● Limiting user permissions hinders productivity
Role 1 Role 2 Role 3 Role 4
Service
Role
🗣 Feni
34. Practical Challenges of Running a Data Security Program
#2: Data Inventory & Data Discovery
● Finding sensitive data is not easy
● Hard to get classification right
🗣 Feni
35. Practical Challenges of Running a Data Security Program
#2: Data Inventory & Data Discovery
● Finding sensitive data is not easy
● Hard to get classification right
🗣 Feni
In case you were
wondering, Waldo is
here!
36. Practical Challenges of Running a Data Security Program
#3: Redefining your SDLC to include the Data team
● Existing frameworks and processes don’t work well
○ Lack of an OWASP Top 10 equivalent
○ What does design reviews look like for a researcher?
● Traditional Security tooling is not built for identifying
Data Security issues
● Lack of security training programs catered to data
teams
○ Pragmatic data governance guidelines
○ stop using catch-all words like “PII” - complement with right tools
🗣 Mukund
37. Practical Challenges of Running a Data Security Program
#4: Culture
● Data team’s culture of exploration runs counter to security team’s culture of
enforcing least privilege
● Data teams aren’t used to Security being involved in their development lifecycle
● Security has focused a lot on Shift left and guardrails for our application and
infrastructure engineers to do things right. What about data?
● Is your team grounded in pragmatism?
🗣 Mukund
39. What Should One Do? The technical stuff…
Approach Notes
Inventory
Data Discovery Understand where all sensitive data is
Role Discovery Understand who has what access privileges
Access
Control
Isolate Individual Services Ensure all ETL jobs use unique credentials
JIT Access Approvals Extend JIT access controls to all data users
Least Privileged Access Require users to assume least privilege role
Segmentation
Data Segregation Restrict data to specific locations based on risk
Client Segmentation Limit specific clients to specific data based on risk
Increased
complexity
🗣 Feni
40. What Should One Do? The technical stuff…
Approach Notes
Inventory
Data Discovery Understand where all sensitive data is
Role Discovery Understand who has what access privileges
Access
Control
Isolate Individual Services Ensure all ETL jobs use unique credentials
JIT Access Approvals Extend JIT access controls to all data users
Least Privileged Access Require users to assume least privilege role
Segmentation
Data Segregation Restrict data to specific locations based on risk
Client Segmentation Limit specific clients to specific data based on risk
Increased
complexity
🗣 Feni
41. What Should One Do? The technical stuff…
Approach Notes
Inventory
Data Discovery Understand where all sensitive data is
Role Discovery Understand who has what access privileges
Access
Control
Isolate Individual Services Ensure all ETL jobs use unique credentials
JIT Access Approvals Extend JIT access controls to all data users
Least Privileged Access Require users to assume least privilege role
Segmentation
Data Segregation Restrict data to specific locations based on risk
Client Segmentation Limit specific clients to specific data based on risk
Increased
complexity
🗣 Feni
42. What Should One Do? The technical stuff…
Approach Notes
Inventory
Data Discovery Understand where all sensitive data is
Role Discovery Understand who has what access privileges
Access
Control
Isolate Individual Services Ensure all ETL jobs use unique credentials
JIT Access Approvals Extend JIT access controls to all data users
Least Privileged Access Require users to assume least privilege role
Segmentation
Data Segregation Restrict data to specific locations based on risk
Client Segmentation Limit specific clients to specific data based on risk
Increased
complexity
🗣 Feni
43. What Should One Do? The process stuff…
Approach Notes
Data
Environment
Data Minimization Only collect data that is absolutely needed
Terraform Modules with
Secure Defaults
Shift-left, emulate what worked in AppSec and Cloud Security
Tooling IAM Helper Tooling Provide tools that enable least privilege role selection
Documentation
Tooling Documentation Ask vendors for best practices for using their tools securely
Tutorials Ask vendors to provide tutorials on security configurations
Runbooks Provide practical operational guidance through runbooks
Increased
complexity
🗣 Feni
44. What Should One Do? The process stuff…
Approach Notes
Data
Environment
Data Minimization Only collect data that is absolutely needed
Terraform Modules with
Secure Defaults
Shift-left, emulate what worked in AppSec and Cloud Security
Tooling IAM Helper Tooling Provide tools that enable least privilege role selection
Documentation
Tooling Documentation Ask vendors for best practices for using their tools securely
Tutorials Ask vendors to provide tutorials on security configurations
Runbooks Provide practical operational guidance through runbooks
Increased
complexity
🗣 Feni
45. What Should One Do? The process stuff…
Approach Notes
Model Builder
Environment
Data Minimization Only collect the data absolutely needed for model building
Terraform Modules with
Secure Defaults
Shift-left, emulate what worked in AppSec and Cloud Security
Tooling IAM Helper Tooling Provide tools that enable least privilege role selection
Documentation
Tooling Documentation Ask vendors for best practices for using their tools securely
Tutorials Ask vendors to provide tutorials on security configurations
Runbooks Provide practical operational guidance through runbooks
Increased
complexity
🗣 Feni
46. What Should One Do? The process stuff…
Approach Notes
Model Builder
Environment
Data Minimization Only collect the data absolutely needed for model building
Terraform Modules with
Secure Defaults
Shift-left, emulate what worked in AppSec and Cloud Security
Tooling IAM Helper Tooling Provide tools that enable least privilege role selection
Documentation
Tooling Documentation Ask vendors for best practices for using their tools securely
Tutorials Ask vendors to provide tutorials on security configurations
Runbooks Provide practical operational guidance through runbooks
Increased
complexity
🗣 Feni
47. What Should One Do? The people stuff…
Approach
Invest in building bridges
Collaboration > Security
Transparency - One can’t fix the issues they can’t see
Help do the job - builds empathy and confidence
Understand that we as an industry have neglected Data teams all along
Plan in the open
Teach them to fish - don’t serve them fish
Increased
complexity
🗣 Mukund
49. But why run collaborative threat models at all in the first place?
Data teams haven’t had to work with the Security teams on an ongoing basis
We’re all trying to figure out how we work - Having this as an activity opens up
opportunities for collaboration
🗣 Mukund
50. Collaborative Threat Modeling with STRIDE
Threat
Spoofing
Tampering
Repudiation
Information Disclosure
Denial of Service
Elevation of Privileges
● Goal:
○ Apply STRIDE-like model to data, in addition to
apps, services and infrastructure
○ Strategically focus on threat patterns that are not
covered in application or infra security, e.g.:
■ SQL injection is accounted for in appsec
🗣 Feni
51. Collaborative Threat Modeling with STRIDE
Threat Applied to Data Security Examples
Spoofing ● Using service accounts to
access data directly
● Looker user runs query on
Snowflake using Looker role
● Looker admin logs into Snowflake
using Looker service account
credentials
Tampering
Repudiation
Information Disclosure
Denial of Service
Elevation of Privileges
🗣 Feni
52. Collaborative Threat Modeling with STRIDE
Threat Applied to Data Security Examples
Spoofing ● DB admin deleting or modifying
data for personal gain
● Application bug resulting in data
being corrupted or deleted
● Creating false data for hijacking
ML model
● IT admin deleting or overwriting
audit logs
● DBA at a bank modifies balance
information
● Complex ETL job deletes certain
records while loading them to
destination
● DBA creates false data for training
ML model that causes fraudulent
transactions to be undetected
Tampering
Repudiation
Information Disclosure
Denial of Service
Elevation of Privileges
🗣 Feni
53. Collaborative Threat Modeling with STRIDE
Threat Applied to Data Security Examples
Spoofing ● All activity managed and
monitored using ROLES within
databases, instead of user
● User could successively assume
different shared roles, performing
operations piecemeal with each
role, making repudiation very
challenging
Tampering
Repudiation
Information Disclosure
Denial of Service
Elevation of Privileges
🗣 Feni
54. Collaborative Threat Modeling with STRIDE
Threat Applied to Data Security Examples
Spoofing ● Sensitive data flows from
production to analytics
dashboard
● Sensitive production data used
in dev for testing or
troubleshooting
● Sensitive data copied to an
accessible location using
service account
● ETL pipeline scrubs PII from SSN
column in PG, but not from JSON
fields, which results in SSN being
present in dashboard
● Developers creating copy of
sensitive data in development for
testing
● Developer running ETL job to
copy restricted sensitive data into
an S3 bucket that they have
access to
Tampering
Repudiation
Information Disclosure
Denial of Service
Elevation of Privileges
🗣 Feni
55. Collaborative Threat Modeling with STRIDE
Threat Applied to Data Security Examples
Spoofing ● Processing job blows up due to
complex computation on large
dataset
● A single Spark job overwhelms
AWS resources, resulting in
cascading failures across other
jobs
Tampering
Repudiation
Information Disclosure
Denial of Service
Elevation of Privileges
🗣 Feni
56. Collaborative Threat Modeling with STRIDE
Threat Applied to Data Security Examples
Spoofing ● Admin creating backdoors on
databases
● Using service accounts to
access data
● DBA or readwrite user GRANTing
increased privileges to an
alternate role
● Looker admin logs into Snowflake
using Looker service account
credentials, and can read data
that they do not have access to
directly
Tampering
Repudiation
Information Disclosure
Denial of Service
Elevation of Privileges
🗣 Feni
57. Putting it all together..
● STRIDE is a valuable framework for security and data teams to work
together to conceptualize and subsequently secure against threats
● There are other frameworks too that one could use, eg:
○ DREAD - developed by the company which is graciously hosting us today
○ PASTA - italian food will not taste the same now that you have heard about this
○ VAST - why did the pond break up with the vast ocean? Because it was too shallow
🗣 Feni
58. Parting Thoughts
● Data tools are neither infra tools, nor app tools -
need to approach them differently
● Concepts from infra, cloud, and app security help,
but need to properly thought through when
applying to data
● Security is important to everyone, but often not at
the cost of productivity
🗣 Feni
59. Parting Thoughts
● Start small - You will have to do a lot of heavy
lifting yourselves initially
● Share/include the data team with everything
you do
● Build developer focused tools that will help
them do what they need to do securely
● Be pragmatic
● Kudos go a long way!
● Ask for help - Builds empathy
● Share what worked/didn’t work with the
industry - we’re all in this together!
🗣 Mukund
When Mission Impossible 1 came out in 1996, I saw the whole movie as a kid with my jaw stuck to the floor. And one of the most memorable scenes was how Tom Cruise, playing the role of Ethan Hunt, steals data from the CIA.
He is able to circumvent lots of security checks like retina scans, thermal sensors and laser rays
and then finally gets in front of the computer, enters stolen credentials
get access to the data which is very neatly organized for him
He opens the file to look at the data in plain text
And finally copies it to a floppy disk and runs away with it
Now, almost 30 years later, the imaginary risk of Ethan Hunt has been replaced by hacker farms across the world
and instead of an IBM PC, the data for most organizations is sprawled across many different cloud services - like Snowflake, Databricks, S3, - accessible through sophisticated tools like Hex, Jupyter, Looker etc.
And this list of services is increasing rapidly with the advent of AI
All of this context brings us to the topic at hand - how do you stop Ethan Hunt, who has compromised other all security controls and is now at the doorstep of your data store, from stealing your most sensitive data.
Thinking of it from first principles
CIA Triad as applied to Data:
Confidentiality – prevent unauthorized access, ensure sensitivity of data
Integrity – prevent tampering, corruption and loss of data
Availability – ensure data is available to all authorized users as and when they need it
Must be applied to entire data lifecycle
Must balance security with productivity and agility
Thinking of it from first principles
CIA Triad as applied to Data:
Confidentiality – prevent unauthorized access, ensure sensitivity of data
Integrity – prevent tampering, corruption and loss of data
Availability – ensure data is available to all authorized users as and when they need it
Must be applied to entire data lifecycle
Must balance security with productivity and agility
Unlike apps, infra and services, it can be mutated. You can derive more data from what you already have on hand, or duplicate it across boundaries
Sensitive data also has a knack for showing up in unexpected places. For example if you transcribe and store customer conversations, all sorts of nasty sensitive things can show up. Identifying this sensitive data and securing it can be a real challenge
Also, data itself provides no intrinsic constraints on how it can be handled. Apps have APIs, infra has protocols, but data has nothing - you can do whatever you want with it, e.g. query it, dump it to your laptop, transfer it to portable disks, duplicate it in the cloud, bulk-read, trickle-read, etc
Every company has data sprawling everywhere from Emails to laptops, fileservers, engineering services, data products, etc
The ownership model is often fuzzy at best. For Example - when party A and party B share data with a third party C, and C then joins that data and aggregates it – who’s data is it?
Also, data team itself is a constellation of multiple types of sub-teams whose roles span infra, services, apps, tools, SRE, etc
While exact organizational structure will obviously differ, most data teams can be functionally categorized into three groups: data engineers, data scientists and analysts
Together these teams are responsible for gathering insights from datasets. This involves acquiring, processing and managing the data, and delivering on SLAs spanning reliability, time-to-insights, etc.
The data engineers are typically the ones responsible for maintaining the data platform, processing all the data in it and its availability
Data scientists are the ones who analyze the data by building models and interpreting results to create insights or better models
Finally analysts are the folks who interface with all the various business teams, to help deliver on their reporting requirements
Let’s explain all this through concrete examples
Imagine there is a company that wants to better market itself to the 21-25 year olds.
A typical way to accomplish this would be:
1. Get a real-time view into how their ads are performing across various demographics and see what’s working and what’s not - this is what data engineers would help them with
2. Once they have some proper baseline, focus on Improving the performance within the target age group - which is what data scientists would help with
3. Ultimately, all these results need to be reported to the executive team and board - that’s what the analysts will help with
Let’s dive a bit deeper into what all this looks like
Let’s see how the data engineers get their real-time dashboards into ads performance by demographics
They will ingest campaign data from from the various advertising platforms - facebooks, youtube, twitter, etc.
They will additionally combine data from internal systems - CRM, lead databases, production applications, etc
All these places store data in different formats and have their own errors and missing fields. Data engineers need to scrub that data, normalize across formats and ingest it into a central platform
This platform will be responsible for things like cataloging, data retention and purging, APIs and schedulers for all the various services, etc
They will then build dashboards on top of it using existing tools like Looker, Superset, etc.
As all this happen all sorts of outages, errors, infra issues, etc will crop up and the data ops team will have to make sure it all keeps running despite those issues
The next step is to figure out how to improve this performance
Data scientists do that by determining factors that influence performance by running Python jobs on Jupyter. For example, what are the common factors across users skipping marketing videos or disliking them etc.
Then they map those factors to dataset features, and run experiments to optimize them for future campaigns. They achieve this by creating models, training them and analyzing them using tools like Sagemaker,
Finally, analysts analyze all the results that the rest of the team stored in the data lake and builds executive dashboard with KPIs, e.g. revenue generated / future predictions for the next 90 days. Analysts typically use spreadsheets, presentation and visualization tools to do this.