MySQL talk at J and Beyond 2014 about nested data sets, how to use them most efficiently, and other performance tips such as string lookup and indexing options.
Normalization is a logical database design method that minimizes data redundancy and reduces design flaws. It involves applying normal forms like 1NF, 2NF, and 3NF to break large tables into smaller subsets. The normal forms improve data integrity by preventing anomalies like insertion, update, and deletion anomalies. Applying the normal forms can result in relations that are in first, second, and third normal form, but additional steps may be needed to attain Boyce-Codd normal form, which further reduces anomalies from overlapping candidate keys.
The document discusses building a real-time search engine for log data. It describes using Flume to collect streaming log data and write it to HDFS files. Fastcatsearch indexes the HDFS files in real-time by creating index segments, merging segments, and removing outdated segments to make data searchable in real-time. The system aims to provide fast indexing and querying of large and continuous log data streams like Splunk.
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...Rakuten Group, Inc.
Astra is a distributed SQL database for data analysis and prediction. We're aiming to achieve near real-time data analysis, and to deliver the components of a Data Lake as a Service which contains it. Astra’s another feature is integration with Machine learning to support many kinds of data analysis.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1nyhaC6.
Nathan Marz discusses building NoSQL-based data systems that are scalable and easy to reason about. Filmed at qconlondon.com.
Nathan Marz is the creator of many open source projects which are relied upon by over 50 companies around the world, including Cascalog and Storm. Nathan is also working on a book for Manning publications entitled "Big Data: principles and best practices of scalable realtime data systems". Nathan was previously the lead engineer at BackType before being acquired by Twitter in 2011.
NoSQL Couchbase Lite & BigData HPCC SystemsFujio Turner
Mobile data is becoming the new source for data. Managing data in the mobile devices has become easier with NoSQL Couchbase Lite mobile database. Making sense, analyzing, scaling to exabytes has also become easier with LexisNexis Big Data platform HPCC Systems.
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
MySQL talk at J and Beyond 2014 about nested data sets, how to use them most efficiently, and other performance tips such as string lookup and indexing options.
Normalization is a logical database design method that minimizes data redundancy and reduces design flaws. It involves applying normal forms like 1NF, 2NF, and 3NF to break large tables into smaller subsets. The normal forms improve data integrity by preventing anomalies like insertion, update, and deletion anomalies. Applying the normal forms can result in relations that are in first, second, and third normal form, but additional steps may be needed to attain Boyce-Codd normal form, which further reduces anomalies from overlapping candidate keys.
The document discusses building a real-time search engine for log data. It describes using Flume to collect streaming log data and write it to HDFS files. Fastcatsearch indexes the HDFS files in real-time by creating index segments, merging segments, and removing outdated segments to make data searchable in real-time. The system aims to provide fast indexing and querying of large and continuous log data streams like Splunk.
Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analy...Rakuten Group, Inc.
Astra is a distributed SQL database for data analysis and prediction. We're aiming to achieve near real-time data analysis, and to deliver the components of a Data Lake as a Service which contains it. Astra’s another feature is integration with Machine learning to support many kinds of data analysis.
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1nyhaC6.
Nathan Marz discusses building NoSQL-based data systems that are scalable and easy to reason about. Filmed at qconlondon.com.
Nathan Marz is the creator of many open source projects which are relied upon by over 50 companies around the world, including Cascalog and Storm. Nathan is also working on a book for Manning publications entitled "Big Data: principles and best practices of scalable realtime data systems". Nathan was previously the lead engineer at BackType before being acquired by Twitter in 2011.
NoSQL Couchbase Lite & BigData HPCC SystemsFujio Turner
Mobile data is becoming the new source for data. Managing data in the mobile devices has become easier with NoSQL Couchbase Lite mobile database. Making sense, analyzing, scaling to exabytes has also become easier with LexisNexis Big Data platform HPCC Systems.
Sasi, cassandra on the full text search ride At Voxxed Day Belgrade 2016Duyhai Doan
The document discusses Apache Cassandra's SASI (SSTable Attached Secondary Index). It provides a 5 minute introduction to Cassandra, introduces SASI and how it follows the SSTable lifecycle, describes how SASI works at the cluster level for distributed queries and indexing, and details the local read/write process including data structures and query planning. Some benchmarks are shown for full table scans on a large dataset using SASI with Spark. The key advantages and use cases for SASI are discussed along with its limitations compared to dedicated search engines.
Simple Nested Sets and some other DB optimizationsEli Aschkenasy
The document discusses nested sets and database design principles for hierarchical data structures. It covers topics such as calculating the total number of nodes, determining if a node is a leaf node, optimizing queries for leaf nodes and subtrees, and finding the path to a node. Examples of SQL queries are provided to demonstrate naive and optimized implementations.
This document discusses using databases in Android apps. It provides an overview of SQLite, the database used for Android, which is a stripped-down version of SQL databases. It describes the basic components of databases, including tables, fields, records, keys and relationships. It also gives examples of how to design a database using data dictionaries, normalization, and data flow diagrams. The next steps involve using SQL to interact with and query the database in an Android app.
This document discusses the concepts of locality of reference and anti-locality of reference in computing. It begins by describing the physical architecture of processors, cores, and caches. It then discusses the logical architecture of processes, threads, and virtual memory spaces. It explains how data moves between the physical and logical layers, and how threads accessing common data can cause cache coherency penalties if not designed carefully. The document emphasizes that locality of reference, where data is accessed from the same cache repeatedly, improves performance, while anti-locality of reference, where data is accessed from different caches, can hurt performance due to cache misses and coherency issues. Careful design is needed to minimize anti-locality and its penalties.
This document discusses index tuning in Microsoft SQL Server. It provides an overview of index types including clustered and nonclustered indexes. It also discusses concepts like covering indexes, unique indexes, filtered indexes and query execution plans. The document aims to help users understand how to think about performance tuning from an index perspective and demystify common index tuning myths. It provides best practices for index tuning and maintaining performance in SQL Server.
This document provides troubleshooting tips for Ex Libris' Primo discovery and delivery system. It begins with general tips such as checking that changes have been saved and deployed correctly. It then addresses specific cases involving issues like records not appearing properly, incorrect labels, availability mismatches, and search/sort problems. The document emphasizes examining the PNX record format, understanding de-duplication and FRBR rules, and checking for potential problems in related systems like the ILS, SFX, or metadata. It concludes by noting the importance of timing with publishing pipes and system performance.
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms.
To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL.
sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL.
Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.
Nearly every application uses some sort of data storage. Proper data structure can lead to increased performance, reduced application complexity, and ensure data integrity. Foreign keys, indexes, and correct data types truly are your best friends when you respect them and use them for the correct purposes. Structuring data to be normalized and with the correct data types can lead to significant performance increases. Learn how to structure your tables to achieve normalization, performance, and integrity, by building a database from the ground up during this tutorial.
This document provides an introduction to Elasticsearch, covering the basics, concepts, data structure, inverted index, REST API, bulk API, percolator, Java integration, and topics not covered. It discusses how Elasticsearch is a document-oriented search engine that allows indexing and searching of JSON documents without schemas. Documents are distributed across shards and replicas for horizontal scaling and high availability. The REST API and query DSL allow full-text search and filtering of documents.
Cassandra introduction apache con 2014 budapestDuyhai Doan
This document provides an introduction and summary of Cassandra presented by Duy Hai Doan. It discusses Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008. The key architecture of Cassandra including its data distribution across nodes, replication for failure tolerance, and consistency models for reads and writes is summarized.
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.
The document discusses normalization in databases. It defines normalization as removing redundant data from tables to improve storage efficiency, data integrity, and scalability. The document outlines the various normal forms including 1NF, 2NF, 3NF, BCNF, and higher normal forms. It provides examples to illustrate different normal forms. Advantages of normalization include reduced database size and better performance, while disadvantages are more tables to join and codes instead of real data.
Everyone dreams of being ‘Web Scale’, but we start out small. We — most of us — don’t launch a service and expect it to serve millions of requests from Day 1. This means that we don’t think about the ways in which our stack will blow up when the number of requests does start climbing. This talk lists simple patterns and checks that Development and Operations teams should implement from Day 1 in order to ensure a robust distributed system.
What You Need To Know About The Top Database TrendsDell World
The last 5 years have seen transformative changes in both personal and enterprise technologies. Many of these changes have been driven by or are driving paradigm shifts in database technologies and information systems. These include trends such as engineered systems including Exadata, "Big Data" technologies such as Hadoop ,"NoSQL" databases, SSDs, in-memory and columnar technologies. In this presentation we’ll review these big trends and describe how they are changing the database landscape and influencing the career prospects for database professionals.
External Master Data in Alfresco: Integrating and Keeping Metadata Consistent...ITD Systems
Real life content is always tightly integrated with master data. Reference data to be used for the content is usually stored in a third-party enterprise system (or even several different systems) and should be consumed by Alfresco.
This document discusses how Splunk can be used to analyze organizational data and provide value to IT teams. Some key points made include:
- Splunk allows users to search and analyze large amounts of machine data in real-time to gain insights from audit trails, errors, user behavior, and other sources.
- It provides alerting capabilities to monitor for critical failures and configuration issues.
- As organizational data volumes grow, Splunk is well-suited to handle "big data" using its distributed, scalable architecture based on map-reduce techniques.
Normalization is the process of removing redundant data from your tables to improve storage efficiency, data integrity, and scalability.
Normalization generally involves splitting existing tables into multiple ones, which must be re-joined or linked each time a query is issued.
Why normalization?
The relation derived from the user view or data store will most likely be unnormalized.
The problem usually happens when an existing system uses unstructured file, e.g. in MS Excel.
The document discusses the design of an analytics database to aggregate data from multiple sources, move the aggregated data to frontend databases, and serve queries efficiently through portals. It addresses partitioning the backend warehouse for writes, replicating data to secondaries, moving data to partitioned frontend databases while serving queries, and optimizing queries and indexes at the frontend. Table partitioning is recommended for the frontend to allow efficient data insertion, removal and query serving while mitigating the impact of conflicting I/O operations.
MySQL 8 introduces support for ANSI SQL recursive queries with common table expressions, a powerful method for working with recursive data references. Until now, MySQL application developers have had to use workarounds for hierarchical data relationships. It's time to write SQL queries in a more standardized way, and be compatible with other brands of SQL implementations. But as always, the bottom line is: how does it perform? This presentation will briefly describe how to use recursive queries, and then test the performance and scalability of those queries against other solutions for hierarchical queries.
Getting Started with Test-Driven Development at Longhorn PHP 2023Scott Keck-Warren
Test-driven development (TDD) is a software development process where test cases are written before code to validate requirements. The TDD process involves short cycles of adding a test, making the test fail, writing code to pass the test, and refactoring code. Automated tests provide confidence to refactor and change code without breaking functionality. Unit tests isolate and test individual code units while feature tests simulate how a user interacts with the application. Code coverage metrics help ensure tests cover enough of the codebase, with higher coverage percentages generally indicating better test quality.
Mais conteúdo relacionado
Semelhante a SQL Database Design For Developers at php[tek] 2024
Simple Nested Sets and some other DB optimizationsEli Aschkenasy
The document discusses nested sets and database design principles for hierarchical data structures. It covers topics such as calculating the total number of nodes, determining if a node is a leaf node, optimizing queries for leaf nodes and subtrees, and finding the path to a node. Examples of SQL queries are provided to demonstrate naive and optimized implementations.
This document discusses using databases in Android apps. It provides an overview of SQLite, the database used for Android, which is a stripped-down version of SQL databases. It describes the basic components of databases, including tables, fields, records, keys and relationships. It also gives examples of how to design a database using data dictionaries, normalization, and data flow diagrams. The next steps involve using SQL to interact with and query the database in an Android app.
This document discusses the concepts of locality of reference and anti-locality of reference in computing. It begins by describing the physical architecture of processors, cores, and caches. It then discusses the logical architecture of processes, threads, and virtual memory spaces. It explains how data moves between the physical and logical layers, and how threads accessing common data can cause cache coherency penalties if not designed carefully. The document emphasizes that locality of reference, where data is accessed from the same cache repeatedly, improves performance, while anti-locality of reference, where data is accessed from different caches, can hurt performance due to cache misses and coherency issues. Careful design is needed to minimize anti-locality and its penalties.
This document discusses index tuning in Microsoft SQL Server. It provides an overview of index types including clustered and nonclustered indexes. It also discusses concepts like covering indexes, unique indexes, filtered indexes and query execution plans. The document aims to help users understand how to think about performance tuning from an index perspective and demystify common index tuning myths. It provides best practices for index tuning and maintaining performance in SQL Server.
This document provides troubleshooting tips for Ex Libris' Primo discovery and delivery system. It begins with general tips such as checking that changes have been saved and deployed correctly. It then addresses specific cases involving issues like records not appearing properly, incorrect labels, availability mismatches, and search/sort problems. The document emphasizes examining the PNX record format, understanding de-duplication and FRBR rules, and checking for potential problems in related systems like the ILS, SFX, or metadata. It concludes by noting the importance of timing with publishing pipes and system performance.
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
Spark SQL is one of the most popular components in big data warehouse for SQL queries in batch mode, and it allows user to process data from various data sources in a highly efficient way. However, Spark SQL is a general purpose SQL engine and not well designed for ad hoc queries. Intel invented an Apache Spark data source plugin called Spinach for fulfilling such requirements, by leveraging user-customized indices and fine-grained data cache mechanisms.
To be more specific, Spinach defines a new Parquet-like data storage format, offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. Even existing Parquet or ORC data files can be loaded using corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow users to define the customized indices based on relation. Currently, B+ tree and bloom filter are the first two types of indices supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment. All you need to do is to pick Spinach from Spark packages when launching the Spark SQL.
sing corresponding adaptors. Data can be cached in off-heap memory to boost data loading. What’s more, Spinach has extended the Spark SQL DDL, to allow user to define the customized indices based on relation. Currently B+ tree and bloom filter are the first 2 types of index we’ve supported. Last but not least, since Spinach resides in the process of Spark executor, there’s no extra effort in deployment, all we need to do is to pick Spinach from Spark packages when launch the Spark SQL.
Spinach has been imported in Baidu’s production environment since Q4 2016. It helps several teams migrating their regular data analysis tasks from Hive or MR jobs to ad-hoc queries. In Baidu search ads system FengChao, data engineers analyze advertising effectiveness based on several TBs data of display and click logs every day. Spinach brings a 5x boost compared to original Spark SQL (version 2.1), especially in the scenario of complex search and large data volume. It optimizes the average search cost from minutes to seconds, while brings only 3% data size increase for adding a single index.
Nearly every application uses some sort of data storage. Proper data structure can lead to increased performance, reduced application complexity, and ensure data integrity. Foreign keys, indexes, and correct data types truly are your best friends when you respect them and use them for the correct purposes. Structuring data to be normalized and with the correct data types can lead to significant performance increases. Learn how to structure your tables to achieve normalization, performance, and integrity, by building a database from the ground up during this tutorial.
This document provides an introduction to Elasticsearch, covering the basics, concepts, data structure, inverted index, REST API, bulk API, percolator, Java integration, and topics not covered. It discusses how Elasticsearch is a document-oriented search engine that allows indexing and searching of JSON documents without schemas. Documents are distributed across shards and replicas for horizontal scaling and high availability. The REST API and query DSL allow full-text search and filtering of documents.
Cassandra introduction apache con 2014 budapestDuyhai Doan
This document provides an introduction and summary of Cassandra presented by Duy Hai Doan. It discusses Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008. The key architecture of Cassandra including its data distribution across nodes, replication for failure tolerance, and consistency models for reads and writes is summarized.
Scaling Machine Learning Feature Engineering in Apache Spark at FacebookDatabricks
Machine Learning feature engineering is one of the most critical workloads on Spark at Facebook and serves as a means of improving the quality of each of the prediction models we have in production. Over the last year, we’ve added several features in Spark core/SQL to add first class support for Feature Injection and Feature Reaping in Spark. Feature Injection is an important prerequisite to (offline) ML training where the base features are injected/aligned with new/experimental features, with the goal to improve model performance over time. From a query engine’s perspective, this can be thought of as a LEFT OUTER join between the base training table and the feature table which, if implemented naively, could get extremely expensive. As part of this work, we added native support for writing indexed/aligned tables in Spark, wherein IF the data in the base table and the injected feature can be aligned during writes, the join itself can be performed inexpensively.
The document discusses normalization in databases. It defines normalization as removing redundant data from tables to improve storage efficiency, data integrity, and scalability. The document outlines the various normal forms including 1NF, 2NF, 3NF, BCNF, and higher normal forms. It provides examples to illustrate different normal forms. Advantages of normalization include reduced database size and better performance, while disadvantages are more tables to join and codes instead of real data.
Everyone dreams of being ‘Web Scale’, but we start out small. We — most of us — don’t launch a service and expect it to serve millions of requests from Day 1. This means that we don’t think about the ways in which our stack will blow up when the number of requests does start climbing. This talk lists simple patterns and checks that Development and Operations teams should implement from Day 1 in order to ensure a robust distributed system.
What You Need To Know About The Top Database TrendsDell World
The last 5 years have seen transformative changes in both personal and enterprise technologies. Many of these changes have been driven by or are driving paradigm shifts in database technologies and information systems. These include trends such as engineered systems including Exadata, "Big Data" technologies such as Hadoop ,"NoSQL" databases, SSDs, in-memory and columnar technologies. In this presentation we’ll review these big trends and describe how they are changing the database landscape and influencing the career prospects for database professionals.
External Master Data in Alfresco: Integrating and Keeping Metadata Consistent...ITD Systems
Real life content is always tightly integrated with master data. Reference data to be used for the content is usually stored in a third-party enterprise system (or even several different systems) and should be consumed by Alfresco.
This document discusses how Splunk can be used to analyze organizational data and provide value to IT teams. Some key points made include:
- Splunk allows users to search and analyze large amounts of machine data in real-time to gain insights from audit trails, errors, user behavior, and other sources.
- It provides alerting capabilities to monitor for critical failures and configuration issues.
- As organizational data volumes grow, Splunk is well-suited to handle "big data" using its distributed, scalable architecture based on map-reduce techniques.
Normalization is the process of removing redundant data from your tables to improve storage efficiency, data integrity, and scalability.
Normalization generally involves splitting existing tables into multiple ones, which must be re-joined or linked each time a query is issued.
Why normalization?
The relation derived from the user view or data store will most likely be unnormalized.
The problem usually happens when an existing system uses unstructured file, e.g. in MS Excel.
The document discusses the design of an analytics database to aggregate data from multiple sources, move the aggregated data to frontend databases, and serve queries efficiently through portals. It addresses partitioning the backend warehouse for writes, replicating data to secondaries, moving data to partitioned frontend databases while serving queries, and optimizing queries and indexes at the frontend. Table partitioning is recommended for the frontend to allow efficient data insertion, removal and query serving while mitigating the impact of conflicting I/O operations.
MySQL 8 introduces support for ANSI SQL recursive queries with common table expressions, a powerful method for working with recursive data references. Until now, MySQL application developers have had to use workarounds for hierarchical data relationships. It's time to write SQL queries in a more standardized way, and be compatible with other brands of SQL implementations. But as always, the bottom line is: how does it perform? This presentation will briefly describe how to use recursive queries, and then test the performance and scalability of those queries against other solutions for hierarchical queries.
Semelhante a SQL Database Design For Developers at php[tek] 2024 (20)
Getting Started with Test-Driven Development at Longhorn PHP 2023Scott Keck-Warren
Test-driven development (TDD) is a software development process where test cases are written before code to validate requirements. The TDD process involves short cycles of adding a test, making the test fail, writing code to pass the test, and refactoring code. Automated tests provide confidence to refactor and change code without breaking functionality. Unit tests isolate and test individual code units while feature tests simulate how a user interacts with the application. Code coverage metrics help ensure tests cover enough of the codebase, with higher coverage percentages generally indicating better test quality.
Getting Started with Test-Driven Development at Longhorn PHP 2023Scott Keck-Warren
Test-driven development (TDD) is a software development process where test cases are written before code to validate requirements. The TDD process involves short cycles of adding a test, making it fail, making it pass, and refactoring code. Using TDD generates an automated test suite that gives developers confidence to refactor and change code quickly. Unit tests validate individual code units in isolation while feature tests validate code as a user would interact with it. Code coverage metrics help ensure tests cover enough of the codebase.
Getting Started with Test-Driven Development at PHPtek 2023Scott Keck-Warren
Scott Keck-Warren gives a presentation on getting started with test-driven development (TDD). He discusses what TDD is, the five phases of the TDD process, and why it is beneficial. He also covers how to use a testing framework like PHPUnit, what code coverage is, and some common pitfalls to avoid like neglecting to run tests or creating tests that are too large or trivial. The presentation aims to provide developers with the essential information needed to understand and implement TDD.
Getting Started with Test-Driven Development at Midwest PHP 2021Scott Keck-Warren
In this presentation, we discussed what Test-Driven Development(TDD) is, how to get started with TDD, work through an example, and discuss how to get started in your application.
Developing a Culture of Quality Code (Midwest PHP 2020)Scott Keck-Warren
This document discusses developing a culture of quality code. It defines quality code as code that is purposeful, maintainable, reliable, efficient, secure and optimized for size. It recommends that individuals focus on techniques like writing clean code, using automated testing and code reviews. It also recommends teams implement processes like requiring testing, conducting code reviews and adopting coding standards. The goal is to improve code quality and maintainability over time by altering both individual and team practices.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
5th LF Energy Power Grid Model Meet-up SlidesDanBrown980551
5th Power Grid Model Meet-up
It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology.
Power Grid Model
The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services.
Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability.
Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization.
What to expect
For the upcoming meetup we are organizing, we have an exciting lineup of activities planned:
-Insightful presentations covering two practical applications of the Power Grid Model.
-An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024.
-An interactive brainstorming session to discuss and propose new feature requests.
-An opportunity to connect with fellow Power Grid Model enthusiasts and users.
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyScyllaDB
Freshworks creates AI-boosted business software that helps employees work more efficiently and effectively. Managing data across multiple RDBMS and NoSQL databases was already a challenge at their current scale. To prepare for 10X growth, they knew it was time to rethink their database strategy. Learn how they architected a solution that would simplify scaling while keeping costs under control.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframePrecisely
Inconsistent user experience and siloed data, high costs, and changing customer expectations – Citizens Bank was experiencing these challenges while it was attempting to deliver a superior digital banking experience for its clients. Its core banking applications run on the mainframe and Citizens was using legacy utilities to get the critical mainframe data to feed customer-facing channels, like call centers, web, and mobile. Ultimately, this led to higher operating costs (MIPS), delayed response times, and longer time to market.
Ever-changing customer expectations demand more modern digital experiences, and the bank needed to find a solution that could provide real-time data to its customer channels with low latency and operating costs. Join this session to learn how Citizens is leveraging Precisely to replicate mainframe data to its customer channels and deliver on their “modern digital bank” experiences.
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
18. Scott’s Rules For Database Design
1. Normalize Your Database For Data
Deduplication
2. Use The Database Engine to Keep Data
Clean
3. Proactively Add Indexes to Keep Queries
Performant
20. Users Table
• Email address
• Password
• Active state
• Hire Date
• Listing of previous passwords
• Office Name
• Office City
• Office Zip
21. Users Table
• Email address (string)
• Password (string)
• Active state (string)
• Hire Date (string)
• Listing of previous passwords (string)
• Office Name (string)
• Office City (string)
• Office Zip (string)
24. Normalize Your Database For Data Deduplication
“[T]he process of structuring a relational
database in accordance with a series of so-
called normal forms in order to reduce data
redundancy and improve data integrity.”
-“Database normalization” on Wikipedia
25. Normalize Your Database For Data Deduplication
• UNF: Unnormalized form
• 1NF: First normal form
• 2NF: Second normal form
• 3NF: Third normal form
• EKNF: Elementary key normal
form
• BCNF: Boyce–Codd normal form
• 4NF: Fourth normal form
• ETNF: Essential tuple normal
form
• 5NF: Fifth normal form
• DKNF: Domain-key normal form
• 6NF: Sixth normal form
26. Normalize Your Database For Data Deduplication
• UNF: Unnormalized form
• 1NF: First normal form
• 2NF: Second normal form
• 3NF: Third normal form
• EKNF: Elementary key normal
form
• BCNF: Boyce–Codd normal form
• 4NF: Fourth normal form
• ETNF: Essential tuple normal
form
• 5NF: Fifth normal form
• DKNF: Domain-key normal form
• 6NF: Sixth normal form
27. Normalize Your Database For Data Deduplication
• Boyce–Codd Normal Form:
• X should be a superkey for every
functional dependency (FD) X−>Y in a
given relation.
31. First Normal Form (1NF)
1. The table contains a unique identifier, also called the primary key, that is
used to identify the row.
2. Each column contains atomic values (values that can not be broken
down)
32. 1NF - users
email password active hire_date
previous_p
assword
office_
name
office_phone office_city office_zip
alice@exa
mple.com
hash1 1 1/1/2024
hash1
hash5
hash6
Main
Office
555-555-5555
Saginaw 48609
avery@exa
mple.com
NULL 1 8/11/2024
hash2
hash7
Hash8
main
office
5555555555 Saginaw 48609
scott@exa
mple.com
hash3 1
May 11th,
23
hash3
Man
office
(555)555-5555
Saginaw 48609
scott@exa
mple.com
hash4 1 Tuesday hash4 Main
555/555/5555
Saginaw 48609
33. 1NF - users
• A unique identifier should be:
• Auto-incrementing int
• UUID
34. 1NF - users
id email password active hire_date
previous_
password
office_
name
office_phone
office_cit
y
office_zip
1
alice@exa
mple.com
hash1 1 1/1/2024
hash1
hash5
hash6
Main
Office
555-555-5555
Saginaw 48609
2
avery@ex
ample.com
NULL 1 8/11/2024
hash2
hash7
Hash8
main
office
5555555555 Saginaw 48609
3
scott@exa
mple.com
hash3 1
May 11th,
23
hash3
Man
office
(555)555-
5555 Saginaw 48609
4
scott@exa
mple.com
hash4 1 Tuesday hash4 Main
555/555/5555
Saginaw 48609
35. 1NF - users
id email password active hire_date
previous_
password
office_
name
office_phone
office_cit
y
office_zip
1
alice@exa
mple.com
hash1 1 1/1/2024
hash1
hash5
hash6
Main
Office
555-555-5555
Saginaw 48609
2
avery@ex
ample.com
NULL 1 8/11/2024
hash2
hash7
hash8
main
office
5555555555 Saginaw 48609
3
scott@exa
mple.com
hash3 1
May 11th,
23
hash3
Man
office
(555)555-
5555 Saginaw 48609
4
scott@exa
mple.com
hash4 1 Tuesday hash4 Main
555/555/5555
Saginaw 48609
42. Second Normal Form (2NF)
1. Is already in 1NF
2. All the non-key columns are dependent on the primary key of the table
43. Second Normal Form (2NF)
id email password active hire_date
office_
name
office_phone office_city office_zip
1
alice@exa
mple.com
hash1 1 1/1/2024
Main
Office
555-555-5555
Saginaw 48609
2
avery@exa
mple.com
NULL 1 8/11/2024
main
office
5555555555 Saginaw 48609
3
scott@exa
mple.com
hash3 1
May 11th,
23
Man
office
(555)555-5555
Saginaw 48609
4
scott@exa
mple.com
hash4 1 Tuesday Main
555/555/5555
Saginaw 48609
44. 2nd - offices
id name phone city zip
1 Main Office
555-555-5555
Saginaw 48609
2 main office 5555555555 Saginaw 48609
3 Man office
(555)555-5555
Saginaw 48609
4 Main
555/555/5555
Saginaw 48609
45. 2NF - users
id email password active hire_date
office_
name
office_phone office_city office_zip
1
alice@exa
mple.com
hash1 1 1/1/2024
Main
Office
555-555-5555
Saginaw 48609
2
avery@exa
mple.com
NULL 1 8/11/2024
main
office
5555555555 Saginaw 48609
3
scott@exa
mple.com
hash3 1
May 11th,
23
Man
office
(555)555-5555
Saginaw 48609
4
scott@exa
mple.com
hash4 1 Tuesday Main
555/555/5555
Saginaw 48609
46. 2NF - users
id email password active hire_date
office_
name
office_phone
office_cit
y
office_zip office_id
1
alice@exa
mple.com
hash1 1 1/1/2024
Main
Office
555-555-5555
Saginaw 48609 1
2
avery@ex
ample.com
NULL 1 8/11/2024
main
office
5555555555 Saginaw 48609 2
3
scott@exa
mple.com
hash3 1
May 11th,
23
Man
office
(555)555-
5555 Saginaw 48609 3
4
scott@exa
mple.com
hash4 1 Tuesday Main
555/555/5555
Saginaw 48609 4
47. 2NF - users
id email password active hire_date office_id
1
alice@example.co
m
hash1 1 1/1/2024 1
2
avery@example.c
om
NULL 1 8/11/2024 2
3
scott@example.co
m
hash3 1 May 11th, 23 3
4
scott@example.co
m
hash4 1 Tuesday 4
49. Third Normal Form (3NF)
1. Is already in 2NF
2. It contains columns that are non-transitively dependent on the primary key
50. 3NF - offices
id name phone city zip
1 Main Office
555-555-5555
Saginaw 48609
2 main office 5555555555 Saginaw 48609
3 Man office
(555)555-5555
Saginaw 48609
4 Main
555/555/5555
Saginaw 48609
51. 3NF - zips
id city
48609 Saginaw
48640 Midland
48642 Midland
48901 Lansing
52. 3NF - zips
id city state
48609 Saginaw MI
48640 Midland MI
48642 Midland MI
48901 Lansing MI
66. Use Correct Column Types
id email password active hire_date office_id
1
alice@example.co
m
hash1 1 1/1/2024 1
2
avery@example.c
om
NULL 1 8/11/2024 2
3
scott@example.co
m
hash3 1 May 11th, 23 3
4
scott@example.co
m
hash4 1 Tuesday 4
5 Hash12 2 2024-04-01 1000
67. Use Correct Column Types
• Numeric: INT, TINYINT, BIGINT, FLOAT, REAL, etc.
• Date/Time: DATE, TIME, DATETIME, etc.
• String: CHAR, VARCHAR, TEXT, etc.
• Binary data types such as: BLOB, etc.
75. Use NOT NULL for Required Fields
mysql>
insert into users
(password)
values
(“just a password?”);
ERROR 1364 (HY000): Field ‘email' doesn't have a default value
76. Use NOT NULL for Required Fields
mysql>
insert into users
(password)
values
(“just a password?”);
ERROR 1364 (HY000): Field ‘email' doesn't have a default value
77. Use NOT NULL for Required Fields
mysql>
insert into users
(email, password)
values
(“s@s”, "just a password?");
ERROR 1364 (HY000): Field 'active' doesn't have a default value
85. Use Foreign Keys For References To Other Tables
id name phone city zip_id
1 Main Office
555-555-5555
Saginaw 48609
2 main office 5555555555 Saginaw 48609
3 Man office
(555)555-5555
Saginaw 48609
4 Main
555/555/5555
Saginaw 48609
86. Use Foreign Keys For References To Other Tables
id name phone city zip_id
1 Main Office
555-555-5555
Saginaw 48609
2 main office 5555555555 Saginaw 48609
3 Man office
(555)555-5555
Saginaw 48609
87. Use Foreign Keys For References To Other Tables
id name phone city zip_id
1 Main Office
555-555-5555
Saginaw 48609
2 main office 5555555555 Saginaw 48609
88. Use Foreign Keys For References To Other Tables
id name phone city zip_id
1 Main Office
555-555-5555
Saginaw 48609
143. What You Need to Know
1. Normalize Your Database For Data
Deduplication
2. Use The Database Engine to Keep Data
Clean
3. Proactively Add Indexes to Keep Queries
Performant
144. What You Need to Know
1. The table contains a unique identifier, also called the primary key, that is
used to identify the row.
2. Each column contains atomic values (values that can not be broken
down)
3. All the non-key columns are dependent on the primary key of the table
4. It contains columns that are non-transitively dependent on the primary key
145. What You Need to Know
• Make the DB Work With You
• Correct Column Types
• NOT NULL for Required Fields
• UNIQUE for Unique Values
• Foreign Keys For References To Other Tables
• Triggers For Complex Requirements
146. What You Need to Know
• Use indexes on commonly searched columns
• Start simple
• See recorded talks about how to add
Ask people for photos
Good morning all!
2 Shocking facts
Anyone else in this boat
Like to think I’m good at working with DBs
Know that there’s always something to learn
Early in my journey: just threw data into it and it spit it back
Might have been magic
Core piece of technology that I don’t understand
<slide>
Didn’t have a more senior level developer who could mentor
So I had to figure it out
Not necessarily a bad thing because that’s how I work best
Push a new feature
users are initially happy
But as usage grows we start finding problems
Angry people
Customers
My boss
Not ideal
Results in me fixing things under distress
Night/weekends
Once in a bathroom at a holiday inn
My goal is to have you learn from my thrama
For those of you who haven’t met me my name is …
Professional PHP Developer for 16 years
// team lead/CTO role for 11 of those 16
Currently Director of Technology at WeCare Connect
Use PHP and mysql for our backend
Also …
That being said My goal for today give you <slide>
These are the rules I give new hires so they can understand our teams design
So we have to figure it out ourselves
All of these rules exist to prevent bugs or performance problem
Like examples
Today’s example is <slide> from a project
Initial version of this database as it existed when we took over project
Track everything using a string
Only going to talk about the first four forms today as the others are hard to understand and demo
1 and 2 give us huge bang for our buck and we start looking a demising returns around 3
A table doesn’t meet any of the conditions of normalization
Essentially a spreadsheet
The table contains a unique identifier, also called the primary key, that is used to identify the row.
Make it auto incrementing a primary key so the database knows how to handle it
Each column contains atomic values (values that can not be broken down)
To solve this we need to create another table
A lot of normalization is fixed with more tables
Now in 1NF
Still a lot of duplication and mismatched data
Review users table
Three sections
Primary key
Second section - all related to that
Third section - not related
First X columns are dependent
Fix? It’s a new table
Create offices
Link our offices table to the users table
Link our offices table to the users table
Drop all the office columns
2nf
When columns are transitively dependent one column's data relies on another column through a third column. For example, our offices' city column is dependent on the zip column which is dependent on the office's id.
To fix this we'll split out the zip in a new table.
To fix this we'll split out the zip in a new table.
As many validation rules as possible
<slide has a bunch>
Not going to prevent lazy me
Right to the DB
This is just hiding future bugs want to prevent that
Not going to prevent lazy me
Right to the DB
This is just hiding future bugs want to prevent that
Not going to prevent lazy me
Right to the DB
This is just hiding future bugs want to prevent that
<slide>
Not the other way around
Let’s start with one of the most basic constraints
Looking back at our users table
Still issues: Blank emails, Date problem , Duplicate emails , Deleted Sites Problem
We can and should enforce rules at application level but …
Next thing: weird dates
Want dates in the “correct” format
Right now if someone asks for all the employees hired in 2023 getting that information will be a challenge
Especially the person who starts on Tuesday
List of all the types in mysql
SQL has a ton of types to best fit our needs
Switch this column to a date
Reformat a little and we get consistent values
Now easy to find everyone who start in 2023
Switch this column to a date
Reformat a little and we get consistent values
Now easy to find everyone who start in 2023
Might have required for field but
Show insert missing email
Might have required for field but
Show insert missing email
Embrace NOT NULL for required columns
Embrace NOT NULL for required columns
Embrace NOT NULL for required columns
Embrace NOT NULL for required columns
Example insert 2 users with same email and password
Example insert 2 users with same email and password
Embrase unique constraints
Allows us to specify this column is unique
Good for thing we never ever want to see two of email is the best option
Embrase unique constraints
Allows us to specify this column is unique
Good for thing we never ever want to see two of email is the best option
Can specify multiple columns for uniqueness
Example: multi-tenant database could support email address uniqueness per office
Gave users access to clean up offices
So they started deleting the duplicates
Gave users access to clean up offices
So they started deleting the duplicates
Gave users access to clean up offices
So they started deleting the duplicates
Gave users access to clean up offices
So they started deleting the duplicates
Deleted locations so the values don’t match
This table is using a join which is breaking the results
User as assigned to locations that no longer exist
Users that belong to non-existent offices
Need some way to say what’s valid
Need some way to say what’s valid
Allow us to define the relationship of one column to another table
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Allow us to define the relationship of one column to another
Performance
Not enough I ever worry about but each FK requires looks
“Magic” according to some developers
Active column can accept any integer value
I also like this for complex requirement that a standard column doesn’t cover
Ex: if a row is one type different fields are not null
Indexes In Life
I love to cook
Love to try new recipes
Leftover food from recipe
Now get a neural network to figure out
But could use cookbooks
Option 1
Go through every page looking for matches
Slow as most don’t meet our criteria
Option 2
Go to back of book to the index and look up ingreditants
Use that to look up recipes
Much faster
Same
Database is going to look at every row
Fine when you have 100 users
Slow when you have 10 million
We’re going to use indexes to tell the database common things we’re going to query on
<click>
For example, I’m going to search commonly on email and active so that’s a prime candidate
For example, I’m going to search commonly on email and active so that’s a prime candidate
For example, I’m going to search commonly on email and active so that’s a prime candidate
For example, I’m going to search commonly on email and active so that’s a prime candidate
All of these rules exist to prevent bugs or performance problem