The document provides an overview and agenda for a presentation on the BlackRay database. It summarizes BlackRay's history and capabilities, positions it relative to other projects, and outlines the team and roadmap. The presentation covers BlackRay's architecture, APIs, management features, clustering support, and roadmap for future improvements.
This presentation was first held at the OpenSQL Camp 2009, part of the FrOSCon conference in St. Augustin, Germany. It gives a nice overview over the project, technology and how it will progress. Find more information at http://www.blackray.org
Experiences with Evangelizing Java Within the DatabaseMarcelo Ochoa
The document discusses experiences with evangelizing the use of Java within Oracle databases. It provides a timeline of Java support in Oracle databases from 8i to 12c. It describes developing, testing, and deploying database-resident Java applications. Examples discussed include a content management system and RESTful web services implemented as stored procedures, as well as the Scotas OLS product for embedded Solr search. The conclusion covers challenges with open source projects, impedance mismatch between databases and Java, and lack of overlap between skillsets.
1) Spark 1.0 was released in 2014 as the first production-ready version containing Spark batch, streaming, Shark, and machine learning libraries.
2) By 2014, most big data processing used higher-level tools like Hive and Pig on structured data rather than the original MapReduce assumption of only unstructured data.
3) Spark evolved to support structured data through the DataFrame API in versions 1.2-1.3, providing a unified way to read from structured sources.
Interactive Data Analysis in Spark Streamingdatamantra
This document discusses strategies for building interactive streaming applications in Spark Streaming. It describes using Zookeeper as a dynamic configuration source to allow modifying a Spark Streaming application's behavior at runtime. The key points are:
- Zookeeper can be used to track configuration changes and trigger Spark Streaming context restarts through its watch mechanism and Curator library.
- This allows building interactive applications that can adapt to configuration updates without needing to restart the whole streaming job.
- Examples are provided of using Curator caches like node and path caches to monitor Zookeeper for changes and restart Spark Streaming contexts in response.
Building end to end streaming application on Sparkdatamantra
This document discusses building a real-time streaming application on Spark to analyze sensor data. It describes collecting data from servers through Flume into Kafka and processing it using Spark Streaming to generate analytics stored in Cassandra. The stages involve using files, then Kafka, and finally Cassandra for input/output. Testing streaming applications and redesigning for testability is also covered.
This document provides an introduction to Apache Flink and its streaming capabilities. It discusses stream processing as an abstraction and how Flink uses streams as a core abstraction. It compares Flink streaming to Spark streaming and outlines some of Flink's streaming features like windows, triggers, and how it handles event time.
This document introduces Infinispan, an open source distributed in-memory data grid. It provides high availability and elasticity through data replication across nodes. The API allows developers to treat Infinispan as a distributed concurrent hash map. Key features include transactions, persistence, querying, and map/reduce capabilities. Infinispan can be used as a local cache, cluster of caches, or autonomous data store accessed via protocols like REST and Memcached. Future plans include improvements to transactions, non-blocking state transfers, and support for cross-datacenter replication and eventual consistency.
This presentation was first held at the OpenSQL Camp 2009, part of the FrOSCon conference in St. Augustin, Germany. It gives a nice overview over the project, technology and how it will progress. Find more information at http://www.blackray.org
Experiences with Evangelizing Java Within the DatabaseMarcelo Ochoa
The document discusses experiences with evangelizing the use of Java within Oracle databases. It provides a timeline of Java support in Oracle databases from 8i to 12c. It describes developing, testing, and deploying database-resident Java applications. Examples discussed include a content management system and RESTful web services implemented as stored procedures, as well as the Scotas OLS product for embedded Solr search. The conclusion covers challenges with open source projects, impedance mismatch between databases and Java, and lack of overlap between skillsets.
1) Spark 1.0 was released in 2014 as the first production-ready version containing Spark batch, streaming, Shark, and machine learning libraries.
2) By 2014, most big data processing used higher-level tools like Hive and Pig on structured data rather than the original MapReduce assumption of only unstructured data.
3) Spark evolved to support structured data through the DataFrame API in versions 1.2-1.3, providing a unified way to read from structured sources.
Interactive Data Analysis in Spark Streamingdatamantra
This document discusses strategies for building interactive streaming applications in Spark Streaming. It describes using Zookeeper as a dynamic configuration source to allow modifying a Spark Streaming application's behavior at runtime. The key points are:
- Zookeeper can be used to track configuration changes and trigger Spark Streaming context restarts through its watch mechanism and Curator library.
- This allows building interactive applications that can adapt to configuration updates without needing to restart the whole streaming job.
- Examples are provided of using Curator caches like node and path caches to monitor Zookeeper for changes and restart Spark Streaming contexts in response.
Building end to end streaming application on Sparkdatamantra
This document discusses building a real-time streaming application on Spark to analyze sensor data. It describes collecting data from servers through Flume into Kafka and processing it using Spark Streaming to generate analytics stored in Cassandra. The stages involve using files, then Kafka, and finally Cassandra for input/output. Testing streaming applications and redesigning for testability is also covered.
This document provides an introduction to Apache Flink and its streaming capabilities. It discusses stream processing as an abstraction and how Flink uses streams as a core abstraction. It compares Flink streaming to Spark streaming and outlines some of Flink's streaming features like windows, triggers, and how it handles event time.
This document introduces Infinispan, an open source distributed in-memory data grid. It provides high availability and elasticity through data replication across nodes. The API allows developers to treat Infinispan as a distributed concurrent hash map. Key features include transactions, persistence, querying, and map/reduce capabilities. Infinispan can be used as a local cache, cluster of caches, or autonomous data store accessed via protocols like REST and Memcached. Future plans include improvements to transactions, non-blocking state transfers, and support for cross-datacenter replication and eventual consistency.
This document outlines the configuration and management of Postgres-XC clusters using the pgxc_ctl tool. It provides details on:
- Configuring Postgres-XC components like GTM, coordinators, and datanodes in a configuration file for pgxc_ctl.
- Using pgxc_ctl commands to initialize, start, stop, failover, and dynamically add or remove nodes from the cluster.
- Key Postgres-XC components that must be configured including the GTM master, coordinator masters and slaves, and datanode masters.
- Configuration parameters that can be specified for each component like the server, port, directories and extra configuration files.
This document summarizes a workshop on migrating from Oracle to PostgreSQL. It discusses migrating the database, including getting Oracle and PostgreSQL instances, understanding how applications interact with databases, and using the ora2pg tool to migrate the database schema and data from Oracle to PostgreSQL.
State management in Structured Streamingdatamantra
This document discusses state management in Apache Spark Structured Streaming. It begins by introducing Structured Streaming and differentiating between stateless and stateful stream processing. It then explains the need for state stores to manage intermediate data in stateful processing. It describes how state was managed inefficiently in old Spark Streaming using RDDs and snapshots, and how Structured Streaming improved on this with its decoupled, asynchronous, and incremental state persistence approach. The document outlines Apache Spark's implementation of storing state to HDFS and the involved code entities. It closes by discussing potential issues with this approach and how embedded stores like RocksDB may help address them in production stream processing systems.
This document provides a summary of Oracle OpenWorld 2014 discussions on database cloud, in-memory database, native JSON support, big data, and Internet of Things (IoT) technologies. Key points include:
- Database Cloud on Oracle offers pay-as-you-go pricing and self-service provisioning similar to on-premise databases.
- Oracle Database 12c includes an in-memory option that can provide up to 100x faster analytics queries and 2-4x faster transaction processing.
- Native JSON support in 12c allows storing and querying JSON documents within the database.
- Big data technologies like Oracle Big Data SQL and Oracle Big Data Discovery help analyze large and diverse data sets from sources like
1. The document discusses the process of productionalizing a financial analytics application built on Spark over multiple iterations. It started with data scientists using Python and data engineers porting code to Scala RDDs. They then moved to using DataFrames and deployed on EMR.
2. Issues with code quality and testing led to adding ScalaTest, PR reviews, and daily Jenkins builds. Architectural challenges were addressed by moving to Databricks Cloud which provided notebooks, jobs, and throwaway clusters.
3. Future work includes using Spark SQL windows and Dataset API for stronger typing and schema support. The iterations improved the code, testing, deployment, and use of latest Spark features.
Postgres Vision 2018: WAL: Everything You Want to KnowEDB
The document is a presentation about PostgreSQL's Write-Ahead Log (WAL) system. It discusses what the WAL is, how it works, and how it is used for tasks like replication, backup and point-in-time recovery. The WAL logs all transactions to prevent data loss during crashes and ensures data integrity. It is critical for high availability and disaster recovery capabilities in PostgreSQL.
PelletServer wraps a range of semantic technologies -- query, reasoning, machine learning, planning, and constraint solving -- in a RESTful interface and sensible set of defaults & conventions. Even the wily shell programmer can build semweb apps with wget!
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
This document provides an introduction to Structured Streaming in Apache Spark. It discusses the evolution of stream processing, drawbacks of the DStream API, and advantages of Structured Streaming. Key points include: Structured Streaming models streams as infinite tables/datasets, allowing stream transformations to be expressed using SQL and Dataset APIs; it supports features like event time processing, state management, and checkpointing for fault tolerance; and it allows stream processing to be combined more easily with batch processing using the common Dataset abstraction. The document also provides examples of reading from and writing to different streaming sources and sinks using Structured Streaming.
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Kendall Clark, CEO of Clark & Parsia, LLC, presented an overview of their new RDF database called Stardog. Key points include that Stardog is fast, lightweight, supports rich APIs, logical and statistical inference, and full-text search. It aims to be the fastest RDF database and supports OWL 2 reasoning and SPARQL queries. Stardog is currently in alpha testing and plans to launch a private beta in early April ahead of its 1.0 release in mid-summer.
From Batch to Streaming with Apache Apex Dataworks Summit 2017Apache Apex
This document discusses transitioning from batch to streaming data processing using Apache Apex. It provides an overview of Apex and how it can be used to build real-time streaming applications. Examples are given of how to build an application that processes Twitter data streams and visualizes results. The document also outlines Apex's capabilities for scalable stream processing, queryable state, and its growing library of connectors and transformations.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Introduction to Structured Data Processing with Spark SQLdatamantra
An introduction to structured data processing using Data source and Dataframe API's of spark.Presented at Bangalore Apache Spark Meetup by Madhukara Phatak on 31/05/2015.
This document discusses exploratory data analysis (EDA) techniques that can be performed on large datasets using Spark and notebooks. It covers generating a five number summary, detecting outliers, creating histograms, and visualizing EDA results. EDA is an interactive process for understanding data distributions and relationships before modeling. Spark enables interactive EDA on large datasets using notebooks for visualizations and Pandas for local analysis.
PyCon 2012 presented information about Grid job management using Python. The document discussed the Large Hadron Collider (LHC) experiment, the Worldwide LHC Computing Grid (WLCG) which provides computing infrastructure for LHC experiments, and grid computing in general. It then summarized how the Academia Sinica Grid Computing Center (ASGC) uses Python for various tasks like job management, monitoring, and integration of grid and cloud computing as part of its role as a WLCG Tier 1 center and for regional e-science collaborations. Specific Python-based systems discussed included PanDA, GRAB, DIRAC, AliEn, and DIANE.
In this talk, we'll walk through RocksDB technology and look into areas where MyRocks is a good fit by comparison to other engines such as InnoDB. We will go over internals, benchmarks, and tuning of MyRocks engine. We also aim to explore the benefits of using MyRocks within the MySQL ecosystem. Attendees will be able to conclude with the latest development of tools and integration within MySQL.
This document outlines the configuration and management of Postgres-XC clusters using the pgxc_ctl tool. It provides details on:
- Configuring Postgres-XC components like GTM, coordinators, and datanodes in a configuration file for pgxc_ctl.
- Using pgxc_ctl commands to initialize, start, stop, failover, and dynamically add or remove nodes from the cluster.
- Key Postgres-XC components that must be configured including the GTM master, coordinator masters and slaves, and datanode masters.
- Configuration parameters that can be specified for each component like the server, port, directories and extra configuration files.
This document summarizes a workshop on migrating from Oracle to PostgreSQL. It discusses migrating the database, including getting Oracle and PostgreSQL instances, understanding how applications interact with databases, and using the ora2pg tool to migrate the database schema and data from Oracle to PostgreSQL.
State management in Structured Streamingdatamantra
This document discusses state management in Apache Spark Structured Streaming. It begins by introducing Structured Streaming and differentiating between stateless and stateful stream processing. It then explains the need for state stores to manage intermediate data in stateful processing. It describes how state was managed inefficiently in old Spark Streaming using RDDs and snapshots, and how Structured Streaming improved on this with its decoupled, asynchronous, and incremental state persistence approach. The document outlines Apache Spark's implementation of storing state to HDFS and the involved code entities. It closes by discussing potential issues with this approach and how embedded stores like RocksDB may help address them in production stream processing systems.
This document provides a summary of Oracle OpenWorld 2014 discussions on database cloud, in-memory database, native JSON support, big data, and Internet of Things (IoT) technologies. Key points include:
- Database Cloud on Oracle offers pay-as-you-go pricing and self-service provisioning similar to on-premise databases.
- Oracle Database 12c includes an in-memory option that can provide up to 100x faster analytics queries and 2-4x faster transaction processing.
- Native JSON support in 12c allows storing and querying JSON documents within the database.
- Big data technologies like Oracle Big Data SQL and Oracle Big Data Discovery help analyze large and diverse data sets from sources like
1. The document discusses the process of productionalizing a financial analytics application built on Spark over multiple iterations. It started with data scientists using Python and data engineers porting code to Scala RDDs. They then moved to using DataFrames and deployed on EMR.
2. Issues with code quality and testing led to adding ScalaTest, PR reviews, and daily Jenkins builds. Architectural challenges were addressed by moving to Databricks Cloud which provided notebooks, jobs, and throwaway clusters.
3. Future work includes using Spark SQL windows and Dataset API for stronger typing and schema support. The iterations improved the code, testing, deployment, and use of latest Spark features.
Postgres Vision 2018: WAL: Everything You Want to KnowEDB
The document is a presentation about PostgreSQL's Write-Ahead Log (WAL) system. It discusses what the WAL is, how it works, and how it is used for tasks like replication, backup and point-in-time recovery. The WAL logs all transactions to prevent data loss during crashes and ensures data integrity. It is critical for high availability and disaster recovery capabilities in PostgreSQL.
PelletServer wraps a range of semantic technologies -- query, reasoning, machine learning, planning, and constraint solving -- in a RESTful interface and sensible set of defaults & conventions. Even the wily shell programmer can build semweb apps with wget!
Anatomy of Data Frame API : A deep dive into Spark Data Frame APIdatamantra
In this presentation, we discuss about internals of spark data frame API. All the code discussed in this presentation available at https://github.com/phatak-dev/anatomy_of_spark_dataframe_api
This document provides an introduction to Structured Streaming in Apache Spark. It discusses the evolution of stream processing, drawbacks of the DStream API, and advantages of Structured Streaming. Key points include: Structured Streaming models streams as infinite tables/datasets, allowing stream transformations to be expressed using SQL and Dataset APIs; it supports features like event time processing, state management, and checkpointing for fault tolerance; and it allows stream processing to be combined more easily with batch processing using the common Dataset abstraction. The document also provides examples of reading from and writing to different streaming sources and sinks using Structured Streaming.
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
Stream processing applications built on Apache Apex run on Hadoop clusters and typically power analytics use cases where availability, flexible scaling, high throughput, low latency and correctness are essential. These applications consume data from a variety of sources, including streaming sources like Apache Kafka, Kinesis or JMS, file based sources or databases. Processing results often need to be stored in external systems (sinks) for downstream consumers (pub-sub messaging, real-time visualization, Hive and other SQL databases etc.). Apex has the Malhar library with a wide range of connectors and other operators that are readily available to build applications. We will cover key characteristics like partitioning and processing guarantees, generic building blocks for new operators (write-ahead-log, incremental state saving, windowing etc.) and APIs for application specification.
Kendall Clark, CEO of Clark & Parsia, LLC, presented an overview of their new RDF database called Stardog. Key points include that Stardog is fast, lightweight, supports rich APIs, logical and statistical inference, and full-text search. It aims to be the fastest RDF database and supports OWL 2 reasoning and SPARQL queries. Stardog is currently in alpha testing and plans to launch a private beta in early April ahead of its 1.0 release in mid-summer.
From Batch to Streaming with Apache Apex Dataworks Summit 2017Apache Apex
This document discusses transitioning from batch to streaming data processing using Apache Apex. It provides an overview of Apex and how it can be used to build real-time streaming applications. Examples are given of how to build an application that processes Twitter data streams and visualizes results. The document also outlines Apex's capabilities for scalable stream processing, queryable state, and its growing library of connectors and transformations.
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Introduction to Structured Data Processing with Spark SQLdatamantra
An introduction to structured data processing using Data source and Dataframe API's of spark.Presented at Bangalore Apache Spark Meetup by Madhukara Phatak on 31/05/2015.
This document discusses exploratory data analysis (EDA) techniques that can be performed on large datasets using Spark and notebooks. It covers generating a five number summary, detecting outliers, creating histograms, and visualizing EDA results. EDA is an interactive process for understanding data distributions and relationships before modeling. Spark enables interactive EDA on large datasets using notebooks for visualizations and Pandas for local analysis.
PyCon 2012 presented information about Grid job management using Python. The document discussed the Large Hadron Collider (LHC) experiment, the Worldwide LHC Computing Grid (WLCG) which provides computing infrastructure for LHC experiments, and grid computing in general. It then summarized how the Academia Sinica Grid Computing Center (ASGC) uses Python for various tasks like job management, monitoring, and integration of grid and cloud computing as part of its role as a WLCG Tier 1 center and for regional e-science collaborations. Specific Python-based systems discussed included PanDA, GRAB, DIRAC, AliEn, and DIANE.
In this talk, we'll walk through RocksDB technology and look into areas where MyRocks is a good fit by comparison to other engines such as InnoDB. We will go over internals, benchmarks, and tuning of MyRocks engine. We also aim to explore the benefits of using MyRocks within the MySQL ecosystem. Attendees will be able to conclude with the latest development of tools and integration within MySQL.
This document provides an overview of autonomous vehicles from legal perspectives. It discusses the international legal framework under the Vienna Convention and emerging regulations. It also examines the legal situation in Switzerland and China. Key issues that laws may need to address for autonomous vehicles are liability, privacy of data collected, and how vehicles should be programmed to handle unavoidable accidents. Regulations will need to adapt to technological advances in autonomous vehicles.
This document discusses strategies for supporting students with emotional and behavioral disabilities as they transition from an alternative school placement back to their home schools. It provides background on common characteristics and perceptions of these students. It also outlines four major categories of support: social skills instruction, classroom management techniques, cooperative learning, and promoting positive self-image. Specific strategies are proposed under each category, such as explicitly teaching social skills, using individualized reward systems, assigning student roles in cooperative groups, and providing more praise to boost self-esteem. The goal is to help these students successfully transition back to their home schools.
- Splunk DB Connect enables reliable, scalable, real-time integration between Splunk Enterprise and relational databases.
- It allows users to enrich Splunk search results with structured data from databases through lookup functions.
- Key features include exploring database schemas, importing and indexing data from databases into Splunk for analysis and visualization, and connecting to new databases quickly and at scale.
Sailors and Marines on the latest on resources. Log on to get started with
initiatives and programs supporting career exploration, financial planning,
their professional and personal VA benefits information and more.
The document summarizes various programs
development. The site also provides Military OneSource provides free
and resources for military families including:
information on education, career and counseling and resources to help with
1) The new Veterans Retraining Assistance technical training opportunities, health the transition process. Call 1-800-342-
Program which provides 12 months of job and wellness, and quality of life issues. 9647 for assistance.
training for unemployed veterans.
2)
The document is an issue of Design Engineering magazine from January/February 2014. It includes articles on various engineering topics such as additive manufacturing, motion control, automation, and Canadian engineering startups. It also contains advertisements and information on upcoming events. The cover story discusses how Canadian university research facilities partner with small and medium enterprises for commercial research and development projects.
This document provides a summary of Hsinchun Chen's education and professional experience. It outlines his PhD in Information Systems from NYU in 1989, as well as his roles as director of several research centers and labs at the University of Arizona focusing on areas such as e-commerce, public safety, infectious disease informatics, and border security. It also lists over 15 major research projects he has led with total funding over $30 million related to knowledge management, digital libraries, bioinformatics, and national security.
1. Nagoya University released its 2008 Environmental and Energy Report.
2. The report details Nagoya University's efforts to reduce CO2 emissions and energy consumption in 2008.
3. Key accomplishments included reducing total CO2 emissions by 11.0% compared to 2006 levels through energy conservation activities and use of renewable energy sources.
The document summarizes the emerging 3D printing industrial sector. It discusses key companies in the space like Ex One, Arcam, and ProtoLabs. It also addresses fundamental questions around the opportunities and challenges around industrial 3D printing of metals. Key points discussed include Arcam's leadership in 3D printed titanium hip implants, various countries' policy support for additive manufacturing research through organizations like America Makes, and predictions that the aerospace, automotive, and medical sectors could represent an $8.4 billion 3D printing market by 2025.
LCU14 310- Cisco ODP
---------------------------------------------------
Speaker: Robbie King
Date: September 17, 2014
---------------------------------------------------
★ Session Summary ★
Cisco to present their experience using ODP to provide portable accelerated access to crypto functions on various SoCs.
---------------------------------------------------
★ Resources ★
Zerista: http://lcu14.zerista.com/event/member/137757
Google Event: https://plus.google.com/u/0/events/ckmld1hll5jjijq11frbqmptet8
Video: https://www.youtube.com/watch?v=eFlTmslVK-Y&list=UUIVqQKxCyQLJS6xvSmfndLA
Etherpad: http://pad.linaro.org/p/lcu14-310
---------------------------------------------------
★ Event Details ★
Linaro Connect USA - #LCU14
September 15-19th, 2014
Hyatt Regency San Francisco Airport
---------------------------------------------------
http://www.linaro.org
http://connect.linaro.org
Netflix Open Source Meetup Season 4 Episode 2aspyker
In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix.
The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs.
The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis.
Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.
Apache Spark is a fast, general engine for large-scale data processing. It provides unified analytics engine for batch, interactive, and stream processing using an in-memory abstraction called resilient distributed datasets (RDDs). Spark's speed comes from its ability to run computations directly on data stored in cluster memory and optimize performance through caching. It also integrates well with other big data technologies like HDFS, Hive, and HBase. Many large companies are using Spark for its speed, ease of use, and support for multiple workloads and languages.
MySQL X protocol - Talking to MySQL Directly over the WireSimon J Mudd
The document discusses the MySQL X Protocol, which introduces a new way for clients to communicate directly with MySQL servers over TCP/IP. It provides an overview of how the protocol works, including capabilities exchange, authentication, querying the server for both SQL and noSQL data, pipelining requests, and the need for a formal protocol specification. Building client drivers requires understanding the protocol by reading documentation, source code, and examples as documentation is still incomplete. Pipelining requests can improve performance over high-latency connections. A standard specification would help driver development and ensure compatibility as the protocol evolves.
The Adventure: BlackRay as a Storage Enginefschupp
- BlackRay is an in-memory relational database and search engine that supports SQL and fulltext search. It was originally developed in 2005 for a phone directory with over 80 million records.
- The presentation discusses BlackRay's architecture, data loading process, indexing performance, and support for transactions, clustering, and APIs. It also describes efforts to implement BlackRay as a MySQL storage engine.
- Going forward, the team aims to improve SQL support, add security features, and explore using BlackRay as a backend for other applications like LDAP directories.
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
AWS Big Data Demystified #1: Big data architecture lessons learned . a quick overview of a big data techonoligies, which were selected and disregard in our company
The video: https://youtu.be/l5KmaZNQxaU
dont forget to subcribe to the youtube channel
The website: https://amazon-aws-big-data-demystified.ninja/
The meetup : https://www.meetup.com/AWS-Big-Data-Demystified/
The facebook group : https://www.facebook.com/Amazon-AWS-Big-Data-Demystified-1832900280345700/
RedisConf17 - Dynomite - Making Non-distributed Databases DistributedRedis Labs
Dynomite is a framework that makes non-distributed databases distributed by adding a proxy layer, auto-sharding, replication across datacenters, and more. It is used at Netflix to power various services by sitting on top of Redis and providing high availability, scalability, and tunable consistency. Conductor is a workflow orchestration engine used by Netflix that stores workflow definitions and state in Dynomite to allow reusable and controllable workflow processes.
USENIX LISA15: How TubeMogul Handles over One Trillion HTTP Requests a MonthNicolas Brousse
TubeMogul grew from few servers to over two thousands servers and handling over one trillion http requests a month, processed in less than 50ms each. To keep up with the fast growth, the SRE team had to implement an efficient Continuous Delivery infrastructure that allowed to do over 10,000 puppet deployment and 8,500 application deployment in 2014. In this presentation, we will cover the nuts and bolts of the TubeMogul operations engineering team and how they overcome challenges.
This document compares the two major open source databases: MySQL and PostgreSQL. It provides a brief history of each database's development. MySQL prioritized ease-of-use and performance early on, while PostgreSQL focused on features, security, and standards compliance. More recently, both databases have expanded their feature sets. The document discusses the most common uses, features, and performance of each database. It concludes that for simple queries on 2-core machines, MySQL may perform better, while PostgreSQL tends to perform better for complex queries that can leverage multiple CPU cores.
- Dynomite is a framework that makes non-distributed databases distributed by adding a proxy layer, auto-sharding, replication across datacenters, and more.
- Netflix uses Dynomite to provide high availability and scalability for several internal microservices and tools like Conductor, an orchestration engine that uses Redis.
- Conductor allows defining workflows as code and executing them in a distributed and scalable way using Dynomite and Dyno Queues.
This document discusses the Infinispan Spark connector, which provides integration between JBoss Data Grid 7 (JDG 7) and Apache Spark. It introduces JDG 7 and Apache Spark and their features. The Infinispan Spark connector allows users to create Spark RDDs and DStreams from JDG cache data, write RDDs and DStreams to JDG caches, and perform real-time stream processing with JDG as the data source for Spark. The connector supports various configurations and provides seamless functional programming with Spark. A demo of examples is referenced.
Human: Thank you for the summary. Can you provide another summary in 2 sentences or less?
This document discusses PostgreSQL database architecture patterns for running PostgreSQL at scale when a relational database as a service like Amazon RDS won't meet needs. It describes challenges faced with MySQL, Redshift and Vertica and how PostgreSQL was better suited through techniques like partitioning by date, TOAST compression, foreign data wrappers, and poor man's parallel processing. Key takeaways are that PostgreSQL supported scaling to petabytes of data, sub-second queries across large date ranges, and custom extensions needed while avoiding limitations and expenses of other database options.
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
QuestDB es una base de datos open source de alto rendimiento. Mucha gente nos comentaba que les gustaría usarla como servicio, sin tener que gestionar las máquinas. Así que nos pusimos manos a la obra para desarrollar una solución que nos permitiese lanzar instancias de QuestDB con provisionado, monitorización, seguridad o actualizaciones totalmente gestionadas.
Unos cuantos clusters de Kubernetes más tarde, conseguimos lanzar nuestra oferta de QuestDB Cloud. Esta charla es la historia de cómo llegamos ahí. Hablaré de herramientas como Calico, Karpenter, CoreDNS, Telegraf, Prometheus, Loki o Grafana, pero también de retos como autenticación, facturación, multi-nube, o de a qué tienes que decir que no para poder sobrevivir en la nube.
This summary provides an overview of the key points from the document in 3 sentences:
The document outlines the agenda for Season 3 Episode 1 of the Netflix OSS podcast, which includes lightning talks on 8 new projects including Atlas, Prana, Raigad, Genie 2, Inviso, Dynomite, Nicobar, and MSL. Representatives from Netflix, IBM Watson, Nike Digital, and Pivotal then each provide a 3-5 minute presentation on their featured project. The presentations describe the motivation, features and benefits of each project for observability, integration with the Netflix ecosystem, automation of Elasticsearch deployments, job scheduling, dynamic scripting for Java, message security, and developing microservices
Openstack is open source software that allows users to create an Infrastructure as a Service (IaaS) cloud by pooling physical compute, storage, and network resources. It provides on-demand, scalable computing and storage through components like Nova (compute), Swift (object storage), Glance (images), Keystone (identity), and Quantum (networking). The presentation covers the architecture and components of Openstack, how it works from a user perspective, its history and motivation, partners, open development model, and the Openstack community in India.
This document discusses using Ansible for infrastructure automation. It provides examples of how Ansible can be used for provisioning infrastructure, configuring servers, patching, backups, cluster deployment, and scaling. It also gives three use cases: creating a platform for a client, integrating Ansible with other tools like vRA and CyberArk, and automating a two year project involving RedHat and Windows systems. It concludes by discussing common problems providing "DevOps as a service" and introducing Crevise PowerOps to address these.
This document summarizes a presentation about designing systems to handle high loads when Chuck Norris is your customer. It discusses scaling architectures vertically and horizontally, RESTful principles, using NoSQL databases like MongoDB, caching with Memcached, search engines like Sphinx, video/image storage, and bandwidth management. It emphasizes that the right technology depends on business needs, and high-load systems require robust architectures, qualified developers, and avoiding single points of failure.
OpenStack Days East -- MySQL Options in OpenStackMatt Lord
In most production OpenStack installations, you want the backing metadata store to be highly available. For this, the de facto standard has become MySQL+Galera. In order to help you meet this basic use case even better, I will introduce you to the brand new native MySQL HA solution called MySQL Group Replication. This allows you to easily go from a single instance of MySQL to a MySQL service that's natively distributed and highly available, while eliminating the need for any third party library and implementations.
If you have an extremely large OpenStack installation in production, then you are likely to eventually run into write scaling issues and the metadata store itself can become a bottleneck. For this use case, MySQL NDB Cluster can allow you to linearly scale the metadata store as your needs grow. I will introduce you to the core features of MySQL NDB Cluster--which include in-memory OLTP, transparent sharding, and support for active/active multi-datacenter clusters--that will allow you to meet even the most demanding of use cases with ease.
MySQL is commonly used as the default database in OpenStack. It provides high availability through options like Galera and MySQL Group Replication. Galera is a third party active/active cluster that provides synchronous replication, while Group Replication is a native MySQL plugin that also enables active/active clusters with built-in conflict detection. MySQL NDB Cluster is an alternative that provides in-memory data storage with automatic sharding and strong consistency across shards. Both Galera/Group Replication and NDB Cluster can be used to implement highly available MySQL services in OpenStack environments.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/building-and-scaling-ai-applications-with-the-nx-ai-manager-a-presentation-from-network-optix/
Robin van Emden, Senior Director of Data Science at Network Optix, presents the “Building and Scaling AI Applications with the Nx AI Manager,” tutorial at the May 2024 Embedded Vision Summit.
In this presentation, van Emden covers the basics of scaling edge AI solutions using the Nx tool kit. He emphasizes the process of developing AI models and deploying them globally. He also showcases the conversion of AI models and the creation of effective edge AI pipelines, with a focus on pre-processing, model conversion, selecting the appropriate inference engine for the target hardware and post-processing.
van Emden shows how Nx can simplify the developer’s life and facilitate a rapid transition from concept to production-ready applications.He provides valuable insights into developing scalable and efficient edge AI solutions, with a strong focus on practical implementation.
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfChart Kalyan
A Mix Chart displays historical data of numbers in a graphical or tabular form. The Kalyan Rajdhani Mix Chart specifically shows the results of a sequence of numbers over different periods.
Project Management Semester Long Project - Acuityjpupo2018
Acuity is an innovative learning app designed to transform the way you engage with knowledge. Powered by AI technology, Acuity takes complex topics and distills them into concise, interactive summaries that are easy to read & understand. Whether you're exploring the depths of quantum mechanics or seeking insight into historical events, Acuity provides the key information you need without the burden of lengthy texts.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Driving Business Innovation: Latest Generative AI Advancements & Success StorySafe Software
Are you ready to revolutionize how you handle data? Join us for a webinar where we’ll bring you up to speed with the latest advancements in Generative AI technology and discover how leveraging FME with tools from giants like Google Gemini, Amazon, and Microsoft OpenAI can supercharge your workflow efficiency.
During the hour, we’ll take you through:
Guest Speaker Segment with Hannah Barrington: Dive into the world of dynamic real estate marketing with Hannah, the Marketing Manager at Workspace Group. Hear firsthand how their team generates engaging descriptions for thousands of office units by integrating diverse data sources—from PDF floorplans to web pages—using FME transformers, like OpenAIVisionConnector and AnthropicVisionConnector. This use case will show you how GenAI can streamline content creation for marketing across the board.
Ollama Use Case: Learn how Scenario Specialist Dmitri Bagh has utilized Ollama within FME to input data, create custom models, and enhance security protocols. This segment will include demos to illustrate the full capabilities of FME in AI-driven processes.
Custom AI Models: Discover how to leverage FME to build personalized AI models using your data. Whether it’s populating a model with local data for added security or integrating public AI tools, find out how FME facilitates a versatile and secure approach to AI.
We’ll wrap up with a live Q&A session where you can engage with our experts on your specific use cases, and learn more about optimizing your data workflows with AI.
This webinar is ideal for professionals seeking to harness the power of AI within their data management systems while ensuring high levels of customization and security. Whether you're a novice or an expert, gain actionable insights and strategies to elevate your data processes. Join us to see how FME and AI can revolutionize how you work with data!
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfMalak Abu Hammad
Discover how MongoDB Atlas and vector search technology can revolutionize your application's search capabilities. This comprehensive presentation covers:
* What is Vector Search?
* Importance and benefits of vector search
* Practical use cases across various industries
* Step-by-step implementation guide
* Live demos with code snippets
* Enhancing LLM capabilities with vector search
* Best practices and optimization strategies
Perfect for developers, AI enthusiasts, and tech leaders. Learn how to leverage MongoDB Atlas to deliver highly relevant, context-aware search results, transforming your data retrieval process. Stay ahead in tech innovation and maximize the potential of your applications.
#MongoDB #VectorSearch #AI #SemanticSearch #TechInnovation #DataScience #LLM #MachineLearning #SearchTechnology
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc
How does your privacy program stack up against your peers? What challenges are privacy teams tackling and prioritizing in 2024?
In the fifth annual Global Privacy Benchmarks Survey, we asked over 1,800 global privacy professionals and business executives to share their perspectives on the current state of privacy inside and outside of their organizations. This year’s report focused on emerging areas of importance for privacy and compliance professionals, including considerations and implications of Artificial Intelligence (AI) technologies, building brand trust, and different approaches for achieving higher privacy competence scores.
See how organizational priorities and strategic approaches to data security and privacy are evolving around the globe.
This webinar will review:
- The top 10 privacy insights from the fifth annual Global Privacy Benchmarks Survey
- The top challenges for privacy leaders, practitioners, and organizations in 2024
- Key themes to consider in developing and maintaining your privacy program
4. What is BlackRay?
●
BlackRay is a relational, in-memory database
●
SQL with JDBC and ODBC Driver support
●
Fulltext (Tokenized) Search in Text fields
●
Object-Oriented API Support
●
Persistence via Files
●
Scalable and Fault Tolerant
●
Open Source, Open Community
●
Available under the GPLv2
4
5. BlackRay History
●
Concept of BlackRay was developed in 1999 for a
Web Phone Directory at Deutsche Telekom
●
Development of current BlackRay started in 2005,
as a Data Access Library
●
First Production Use at Deutsche Telekom in 2006
●
Evolution to Data Engine in 2007 and 2008
●
Open Source under GPLv2 since June 2009
5
6. Why Build Another Database?
●
Rather unique set of requirements:
●
Phone Directory with approx. 80 Million subscribers
●
All queries in the 10-500 Millisecond range
●
Approximately 2000 concurrent users
●
Over 500 Queries per Second (sustained)
●
Updates once a day may only take several minutes
●
Index needs to be tokenized (SQL: CONTAINS)
●
Phonetic search
●
Extensive Wildcard queries (leading/midspan/trailing)
6
7. Decision to Implement BlackRay
●
Decision was formed in Mid 2005
●
Designed as a lightweight data access engine
●
Implementation in C++ for performance and
maintainability
●
APIs for access across the network from different
languages (Java, C++)
●
Feature set was designed for our specific business
case in a particular project
7
8. Current Release
●
Current release 0.9.0 released on June 12th, 2009
●
Production level quality
●
All relevant index functions for large scale search
applications
●
APIs in C++ and Java fully functional
●
SQL still only a small subset of the API functionality
8
9. Some more details
●
Written in C++
●
Relies heavily on boost
●
Compiles well with gcc and Sun Studio 12
●
Behaves well on Linux, Solaris, OpenSolaris and
MacOS
●
Complete 64 Bit development and platform support
●
Use of cmake for multi platform support
9
11. Why call it Data Engine?
●
BlackRay is a hybrid between a relational database
and a search engine → thus we call it „data engine“
●
Database features:
●
Relational structure, with Join between tables
●
Wildcards and index functions
●
SQL and JDBC/ODBC
●
Search Engine Features
●
Fulltext retrieval (token index)
●
Phonetic and similar approximation search
●
Extremely low latency
11
12. BlackRay Architecture
Postgres* SQL Management
Clients Interface Server
Data Universe
C++ API
Instance
Server 5-Perspective Index
Java API
L1: Dictionary
Redo Log L2: Postings
Python API
L3: Row Index
L4: Multi-Tokens
C# API
L5: Multi-Values
Snapshots <
PHP API
12
13. Hierarchical Model
●
Each BlackRay node can hold many Instances
●
One Management process per node
●
Each Instance runs in a separate Process
●
Instances are completely separated from each other
●
Snapshots (for persistence) are taken on an
Instance level
●
In an Instance, Schemas and Tables can be created
●
Within one Instance, Queries can span across
Tables and Schemas
13
14. Getting Data Into BlackRay
●
Once an Instane is created, data can be loaded
●
Schemas, Tables, and Indexes are created using an
XML description language
●
Standard loader utility to load CSV data
●
Bulk loading is done with logging disabled
●
Indexing is done with maximum degree of
parallelism depending on CPUs
●
After all data is indexed, a snapshot can be taken
14
15. Basic Load Performance Data
●
German yellowpage and whitepage data
●
60 Million subscribers
●
100 Million phone numbers
●
Raw data approx 16GB
●
Indexing performance
●
Total index size 11GB
●
Time to index: 40 Minutes, on dual 2GHz Xeon (Linux)
●
Updates: 300MB, 200K rows in approx 5 minutes
●
Time to load snapshot: 3.5 Minutes for 11GB
15
16. Data Universe
●
BlackRay features a 5-Perspective Index
●
Layer 1: Dictionary
●
Layer 2: Postings
●
Layer 3: Row Index
●
Layer 4: Multi-Token Layer
●
Layer 5: Multi-Value Layer
●
Layer 1 and 2 comprise a fully inverted Index
●
Statistics in this Index used for Query Plan Building
●
All data - index and raw output - are held in memory
16
17. Snapshots and Data Versioning
●
Persistence is done via file based snapshots
●
Snapshots consist of all schemas in one instance
●
Snapshots have a version number
●
To make a backup of the data, simply copy the
snapshot file to a backup media
●
It is possible to load an older snapshot: Data is
version controlled if older snapshots are stored
●
Note: Currently Snapshots are not fully portable.
17
18. Transactions in BlackRay
●
BlackRay supports transactions via a Redo Log
●
All commands that modify data are logged if
requested
●
In case of a crash, the latest snapshot will be loaded
●
Replay of the transaction log will then bring the
database back to a consistent state
●
Redo Log is eliminated when a snapshot is persisted
●
For better performance snapshots should be taken
periodically, ideally after each bulk update
18
19. Native Query APIs
●
C++, Java and Python Object Oriented APIs are
available
●
Built with ICE (ZeroC) as the Network and Object
Brokerage Protocol
●
Query Objects are constructed using an Object
Builder
●
Execution via the network, Results as Objects
●
Load balacing and failover built into the protocol
●
Native support for Ruby, PHP and C# upcoming
19
20. Native Query APIs
●
Queries can use any combination of OR, AND
●
Index functions (phonetic, sysnonyms, stopwords)
can be stacked
●
Token search (fulltext) and wildcard are also
supported
●
Advantage of the APIs:
●
Minimized overhead
●
Very low latency
●
High availability
20
21. Standard Query Interface
●
BlackRay implements the PostgreSQL server socket
interface
●
All PostgreSQL compatible frontends can be utilized
against BlackRay
●
JDBC, ODBC, native drivers using socket interface
are all supported
●
Limitations: SQL is only a reasonable subset of
SQL92, Metadata is not yet fully implemented...
●
Currently not all index functions can be accessed
via SQL statements
21
22. Management Features
●
Management Server acts as central broker for all
Instances
●
Command line tools for administration
●
SNMP management:
●
Health check of Instances
●
Statistics , including access counters and performance
measurements
22
23. Clustering and Replication
●
BlackRay supports a multi-node setup for fault
tolerance
●
Cluster have N query nodes and single update node
●
All query nodes are equal and require the full index
data available
●
Index creation is done offline on the update node
●
Query nodes can reload snapshots to receive
updated data, via shared storage or local copy
●
Load-balancing handled by native APIs
23
24. Administration Requirements
●
In-Memory Databases require very little
administrative tasks
●
Configuration via one file per Instance
●
Disk layout etc all are of no importance
●
Backups are performed by copying snapshots
●
Recovery is done by restoring a snapshot
●
No daily administration required
●
SNMP allows remote supervision with common tools
24
26. BlackRay and other Projects
●
BlackRay is being positioned as a Query intensive
database addition
●
Ideally suited where updates are done in Bulk and
searches outweigh updates by many orders of
magnitude
●
Wildcards come with little overhead: No overhead
for trailing wildcard, some overhead for leading and
midspan wildcard
●
Good match when index functions such as phonetic
or word rotation/position search combined with
relational data are required
26
27. FOSS Projects with similar Goals
●
Relational Databases (the usual suspects):
●
MySQL, MariaDB, Drizzle....
●
PostgreSQL...
●
Many more alternatives, including embedded etc....
●
Fulltext Search:
●
Sphinx
●
Lucene
●
In-Memory Databases:
●
FastDB
●
HSQLDB/H2 (Java → Garbage collection issues....)
27
28. Commercial Alternatives
●
Commercial In-Memory Databases
●
ORACLE/TimesTen (Acquired by ORACLE in 2005)
●
IBM/SolidDB (Acquired by IBM in 2007)
●
VoltDB (No real data available as of yet)
●
eXtremeDB (embedded use only)
●
Dual-Licensed Alternatives
●
CSQL (Open Source Version is severely crippled)
●
MySQL with memcached
28
29. Is it the right thing for me?
●
BlackRay is not designed as a 100% RDBMS
replacement
●
Questions to ask:
●
Do I need ad-hoc data updates, or are updates done in
bulk?
●
How important are fulltext search and extensive
wildcards?
●
How large is my data? Gigabytes: OK. Terrabytes: Not yet
●
Do I need a relational data model?
●
Is SQL an important feature?
29
30. BlackRay will fit you well when...
… searches outweigh updates
… data is primarily updated in bulk
… fulltext search is important
… you have Gigabytes (not Terabytes) of data
… index functions (phonetic …) are required
… SQL is necessary
… a relational data model is required (JOIN)
… source code must be available
… high availability/clustering is required
30
32. Immediate Roadmap
●
Upcoming 0.10.0 – Due in December 2009
●
Complete rewrite of SQL Parser (boost::spirit2)
●
PostgreSQL client compatibility (via network protocol) to
allow JDBC/ODBC... via PostgreSQL driver
●
Rewritten CLI tools
●
Major bugfixes (potential memory leaks)
●
Authentication supported for Instances
●
BlackRay Admin Console (Remora) 0.10
32
33. Immediate Roadmap
●
Planned 0.11.0 – Due in February 2010
●
Support for ad-hoc data manilulation via SQL
(INSERT/UPDATE/DELETE)
●
Aggregate functions for SELECT
●
Make all index functions available in SQL
●
Support for Prepared Statements (ODBC/JDBC)
●
Improved thread and memory management (Perftools?)
●
BlackRay Admin Console (Remora) 0.11
●
Engine Statistics via GUI
●
Cluster Node management
33
34. Midterm Roadmap
●
Scalability Features
●
Sharding & Partitioning Options
●
Federated Search
●
Fully portable snapshot format (across platforms)
●
Query Performance Analyzer
●
Improved Statistics Module with GUI
●
Removal of ICE as a core component
●
BlackRay as a Storage Backend for SUN OpenDS
LDAP Engine
34
35. Midterm Roadmap
●
Security Features
●
Improved User and Access Control concepts
●
SSL for all connections
●
External User Store (LDAP/OpenSSO/PAM...)
●
Increased Platform support
●
Windows 7 and Windows Server platforms
●
Embedded platforms
●
Other, random features by popular request.
35
36. Longterm Roadmap
●
Integration with other DBMSs
●
Storage Engine for other DBMSs
MariaDB/MySQL: → Depends on potential modification
needs of the storage engine interfaces
●
Trigger-based updates to support BlackRay as a Query
cache instance over a regular DBMS
●
Update: The Storage Engine Interface Issue was
discussed at the OpenSQL Camp 2009 in Portland,
Tokutek suggested a portable and more efficient
Universal Storage Engine Interface
36
37. Longterm Roadmap
●
Standalone Engine Improvements
●
SQL92 compliance: SUBSELECT/UNION support
●
Triggers in BlackRay
●
Full ACID Transaction support
37
39. SoftMethod GmbH
●
SoftMethod GmbH initiated the project in 2005 and
●
Company was founded in 2004 and currently has
10 employees
●
Focus of SoftMethod is high performance software
engineering
●
Product portfolio includes directory assistance and
LDAP enabled applications
●
SoftMethod also offers load testing and technical
software quality assurance support.
39
40. Development Team
●
SoftMethod Core Team
●
Thomas Wunschel – Lead Developer
●
Felix Schupp – Project Sponsor
●
Andreas Meyer – Documentation, Porting
●
Frank Fiedler – PhD Student in Database Systems
●
Key Contributors
●
Mike Alexeev – Senior Developer
●
Souvik Roy, Intern at SoftMethod GmbH
●
Andreas Strafner – Developer, Porting to AIX
40
41. Thomas Wunschel
●
Director of Development, SoftMethod GmbH
●
Almost 10 years of development experience
●
Involved with BlackRay and its applications since
2005
●
Currently involved in the Network Protocol Stack
●
Lead Designer and Decision Lead for new Features
41
42. Felix Schupp
●
Managing Director, SoftMethod GmbH
●
Over 10 years of commercial software development
●
Designer of first BlackRay predecessor in 1999
●
Project sponsor and spokesperson
●
Responsible for funding and applications
●
Guide and coach in the development process
42
43. Mike Alexeev
●
Senior Software Developer
●
Over 10 years of C++ experience
●
First outside committer to BlackRay
●
Currently involved in rewriting the SQL grammar to
support all index features available via the APIs
43
44. Souvik Roy
●
Computer Science Student and Software Developer
●
Seceral years of C++ and Java experience
●
Current Intern at SoftMethod GmbH
●
Google Summer of Code participant
●
Working on reliable performance comparison with
other Engines
●
Provides additional sample applications
44
46. What to do next
●
Get BlackRay:
●
Register yourself on http://forge.softmethod.de
●
SVN checkout available at
http://svn.softmethod.de/opensource/blackray/trunk
●
Get Involved
●
Anyone can register and create tickets, news etc
●
We have an active mailing list for discussion as well
●
Contribute
●
We require a signed Contributor agreement before being
allowed commit access to the repository
46