Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

•

9 gostaram•3,756 visualizações

This document discusses speeding up OLAP cube building in Apache Kylin using Spark. Cubing with MapReduce can be slow due to serialization overhead and repeated job submissions. Spark allows caching data in memory across cuboid layers in one job, significantly reducing build times compared to MapReduce as shown in a benchmark on a 160 million row dataset. Spark simplifies Kylin development and brings capabilities for real-time OLAP and cloud integration.

Dados e análise

Luke Han, luke@kyligence.io
Shaofeng Shi, shaofeng@kyligence.io
Apache Kylin:
Speed up Cubing with Spark

About Us
Shaofeng Shi (史少锋)
• Apache Kylin PMC
• Sr. Architect,
Kyligence Inc.
Luke Han (韩卿)
• VP of Apache Kylin
• Co-founder & CEO,
Kyligence Inc.
formed by the team who created Apache Kylin
http://kyligence.io

Agenda
• What is Apache Kylin
• Challenges with MapReduce
• Speed up Cubing with Spark
• Q&A

About Apache Kylin
ü Leading open source OLAP on Hadoop
ü Fast growing open source community
ü Adopted by 200+ global organizations
ü First born in China Apache Top Level Project
ü InfoWorld BossieAward:
– Best Open Source Big Data Tool (2015)
– Best Open Source Big Data Tool (2016)
Kylin / ˈkiːˈlɪn / 麒麟
-- n. (in Chinese art) a mythical animal of composite form
Apache Kylin
Extreme OLAP Engine for Big Data
http://kylin.apache.org

Global Users 200+ use cases in production

About Apache Kylin
• Extreme OLAP Engine on Hadoop
ü High Performance
ü High Concurrency
ü ANSI SQL
ü Native on Hadoop
ü Cloud Ready

The Basic: OLAP Cube
• Pre-calculate metrics by dimensions

The Magic: Pre-Calculation
Sort
Aggr
Filter
Tables
Join
Sort
Filter
Online:Query
from Cube
Post
Aggr.
SELECT
returned,
status,
sum(quantity),
sum(price)
FROM
lineitem INNER JOIN orders
on l_orderkey =o_orderkey
WHERE
shipdate =’2016-09-16'
GROUP BY
returned, status
ORDER BY
returned, status;
Offline:Cubing
Cube

Apache Kylin: O(1)
O(N)
O(1)
Apache Kylin
Data Size
Response
Time
Other Engines

Star-Schema Benchmark
Run SSB at 10, 20 and 40 million-row scales,
Kylin’s response time keep stable
Read more at: https://github.com/kyligence/ssb-kylin
Kylin vs Hive, O(1) : O(N)

Cubing Process
Cubing: Map -> Shuffle -> Reduce

Build Cube with MapReduce
• Calculate Cuboids by layer :N
dim (Base cuboid), N-1 dim,
N-2…, 1, 0
• Reuse previous layer’s result
• HDFS used for data sharing
• Totally need N round MR;

Challenges with MapReduce
• Slow data sharing
ü Serialization, Replication, Deserialization…
• Repeated job submission
ü Submit dependent jar/files repeatedly
ü Re-queue when cluster is busy
• Short of streaming support
ü Couldn’t support low latency data processing
• Limited storage support
ü Couldn’t ingest data from non-HDFS

Speed up Cubing with Spark
• Abstract each layer cuboids as
a RDD
• Cache parent RDD to generate
Child RDD
• Export RDD when Child be
generated
• As-is measure aggregators can
be reused with Spark Java API
RDD-1
RDD-2
RDD-3
RDD-4
RDD-5

Speed up Cubing with Spark
Reuse data in memory

Speed up Cubing with Spark
One job finishes all layers aggregation!

Performance Benchmark
• Environment
ü 4 nodes Hadoop cluster; each has 28 GB RAM and 12 cores
ü CDH 5.8, Apache Kylin 2.0
• Spark
ü Spark 1.6.3 on YARN
ü 6 executors, each has 4 cores, 5GB memory
• Test Data
ü Airline data of US DoT, totally 160 million rows
ü Cube:10 dimensions, 5 measures (SUM)
• Test Scenarios
ü Build the cube at different scale: 3 million, 50 million and 160 million rows

Spark Cubing vs. MR Cubing
Build Cube Time Comparison
BuildTime(minute)
0
8
15
23
30
Source data rows
3-million 50-million 160-million
MapReduce Spark MapReduce Spark MapReduce Spark
17
8
3
29
19
6

Spark Cubing vs. MR In-Mem Cubing
MR In-mem is slow when dataset is big
and random distributed
When data is in sharding,
MR In-mem can be fast
Spark is fast &
stable regardless
of data distribution

Benefits with Spark
• Spark speeds up Cubing at 1x
ü Half time be saved
• Spark simplifies Kylin’s development
ü More functions, less codes
• Spark brings Kylin to a new era
ü Real-time OLAP, Ad hoc query, Cloud integration

Thank You.
Join Apache Kylin community
dev@kylin.apache.org
Visit Kylin & Kyligence home
http://kylin.apache.org
http://kyligence.io

Mais conteúdo relacionado

Mais procurados

Lakehouses are quickly growing in popularity as a new approach to Data Platform Architecture bringing some of the long-established benefits from OLTP world to OLAP, including transactions, record-level updates/deletes, and changes streaming. In this talk, we will discuss Apache Hudi and how it unlocks possibilities of building your own fully open-source Lakehouse featuring a rich set of integrations with existing technologies, including Apache Pulsar. In this session, we will present: - What Lakehouses are, and why they are needed. - What Apache Hudi is and how it works. - Provide a use-case and demo that applies Apache Hudi’s DeltaStreamer tool to ingest data from Apache Pulsar.

Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...

StreamNative

Memory management is at the heart of any data-intensive system. Spark, in particular, must arbitrate memory allocation between two main use cases: buffering intermediate data for processing (execution) and caching user data (storage). This talk will take a deep dive through the memory management designs adopted in Spark since its inception and discuss their performance and usability implications for the end user.

Deep Dive: Memory Management in Apache Spark

Databricks

Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet’s features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he’s learned, creating the missing guide you need. Topics include: * The tools and techniques Netflix uses to analyze Parquet tables * How to spot common problems * Recommendations for Parquet configuration settings to get the best performance out of your processing platform * The impact of this work in speeding up applications like Netflix’s telemetry service and A/B testing platform

Parquet performance tuning: the missing guide

Ryan Blue

Developing Real-Time Data Pipelines with Apache Kafka

Joe Stein

The Evolution of Apache Kylin

DataWorks Summit/Hadoop Summit

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Databricks

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Databricks

Running Apache Kafka in production is only the first step in the Kafka operations journey. Professional Kafka users are ready to handle all possible disasters - because for most businesses having a disaster recovery plan is not optional. In this session, we’ll discuss disaster scenarios that can take down entire Kafka clusters and share advice on how to plan, prepare and handle these events. This is a technical session full of best practices - we want to make sure you are ready to handle the worst mayhem that nature and auditors can cause. Visit www.confluent.io for more information.

Disaster Recovery Plans for Apache Kafka

confluent

PySpark Best Practices

Cloudera, Inc.

Optimizing spark jobs through a true understanding of spark core. Learn: What is a partition? What is the difference between read/shuffle/write partitions? How to increase parallelism and decrease output files? Where does shuffle data go between stages? What is the "right" size for your spark partitions and files? Why does a job slow down with only a few tasks left and never finish? Why doesn't adding nodes decrease my compute time?

Apache Spark Core—Deep Dive—Proper Optimization

Databricks

Apache Spark in Depth: Core Concepts, Architecture & Internals

Anton Kirillov

Not all workloads allow cloud computing. Low latency, cybersecurity, and cost-efficiency require a suitable combination of edge computing and cloud integration. This session explores architectures and design patterns for software and hardware considerations to deploy hybrid data streaming with Apache Kafka anywhere. A live demo shows data synchronization from the edge to the public cloud across continents with Kafka on Hivecell and Confluent Cloud.

Kafka for Real-Time Replication between Edge and Hybrid Cloud

Kai Wähner

Enabling Vectorized Engine in Apache Spark

Kazuaki Ishizaki

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning

confluent

Accelerating Big Data Analytics with Apache Kylin

Tyler Wishnoff

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Databricks

What if Looker saw the queries you just executed and could predict your next query? Could it make those queries faster, by smarter caching, or aggregate navigation? Could it read your past SQL queries and help you write your LookML model? Those are some of the reasons to add relational algebra into Looker’s query engine, and why Looker hired Julian Hyde, author of Apache Calcite, to lead the effort. In this talk about the internals of Looker’s query engine, Julian Hyde will describe how the engine works, how Looker queries are described in Calcite’s relational algebra, and some features that it makes possible. A talk by Julian Hyde at JOIN 2019 in San Francisco.

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...

Julian Hyde

Hudi architecture, fundamentals and capabilities

Nishith Agarwal

Introduction SQL Analytics on Lakehouse Architecture

Databricks

This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.

Introduction to Apache Flink - Fast and reliable big data processing

Till Rohrmann

Mais procurados (20)

Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...

Deep Dive: Memory Management in Apache Spark

Parquet performance tuning: the missing guide

Developing Real-Time Data Pipelines with Apache Kafka

The Evolution of Apache Kylin

Fine Tuning and Enhancing Performance of Apache Spark Jobs

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

Disaster Recovery Plans for Apache Kafka

PySpark Best Practices

Apache Spark Core—Deep Dive—Proper Optimization

Apache Spark in Depth: Core Concepts, Architecture & Internals

Kafka for Real-Time Replication between Edge and Hybrid Cloud

Enabling Vectorized Engine in Apache Spark

From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning

Accelerating Big Data Analytics with Apache Kylin

A Thorough Comparison of Delta Lake, Iceberg and Hudi

Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...

Hudi architecture, fundamentals and capabilities

Introduction SQL Analytics on Lakehouse Architecture

Introduction to Apache Flink - Fast and reliable big data processing

Semelhante a Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

Given at Data Day Seattle 2015. Bitly generates over 9 billion clicks on shortened links a month, as well as over 100 million unique link shortens. Analyzing data of this scale is not without its challenges. At Bitly, we have started adopting Apache Spark as a way to process our data. In this talk, I’ll elaborate on how I use Spark as part of my data science workflow. I’ll cover how Spark fits into our existing architecture, the kind of problems I’m solving with Spark, and the benefits and challenges of using Spark for large-scale data science.

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Sarah Guido

Big Data Processing with Apache Spark 2014

mahchiev

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Evan Chan

Apache Spark in Industry

Dorian Beganovic

http://bit.ly/1BTaXZP – As organizations look for even faster ways to derive value from big data, they are turning to Apache Spark is an in-memory processing framework that offers lightning-fast big data analytics, providing speed, developer productivity, and real-time processing advantages. The Spark software stack includes a core data-processing engine, an interface for interactive querying, Spark Streaming for streaming data analysis, and growing libraries for machine-learning and graph analysis. Spark is quickly establishing itself as a leading environment for doing fast, iterative in-memory and streaming analysis. This talk will give an introduction the Spark stack, explain how Spark has lighting fast results, and how it complements Apache Hadoop. By the end of the session, you’ll come away with a deeper understanding of how you can unlock deeper insights from your data, faster, with Spark.

Intro to Apache Spark by CTO of Twingo

MapR Technologies

This session covers our experience with using the Spark and Shark frameworks for running real-time queries on top of Cassandra data.We will start by surveying the current Cassandra analytics landscape, including Hadoop and HIVE, and touch on the use of custom input formats to extract data from Cassandra. We will then dive into Spark and Shark, two memory-based cluster computing frameworks, and how they enable often dramatic improvements in query speed and productivity, over the standard solutions today.

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

DataStax Academy

The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark. Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including: * optimizing cluster setup; * configuring the cluster; * ingesting data; and * monitoring long-running jobs. We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters. Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Jen Aman

Deep Learning on Apache® Spark™: Workflows and Best Practices

Databricks

Deep Learning on Apache® Spark™: Workflows and Best Practices

Jen Aman

Webinar - DreamObjects/Ceph Case Study

Ceph Community

Nous vivons une époque où tout est connecté, de nos ampoules à notre éditeur de texte, les objets et services qui nous entourent devienne de plus en plus intelligents. Pour ce faire ils génèrent des données, elles sont nécessaires au bon fonctionnement du service ou de l'objet, mais elles sont également utiles pour faire évoluer les produits. Ces données peuvent rapidement représenter de gros volumes, plusieurs dizaines voir centaines de gigaoctets. Une question se pose alors, comment traiter de tels volumes ? Comment en tirer du sens et de la valeur à cette échelle ? Avec OVHcloud Data Processing, une solution basée sur le framework Apache Spark, nous répondons à ce besoin. Venez découvrir comment vous aussi, en quelques cliques, pouvez exécuter votre code sur une infrastructure taillée pour vos besoins. Au travers de différents exemples, comme une analyse du traffic des taxis New-Yorkais, nous verrons comment Data Processing a été pensé, comment il fonctionne et comment il peut être utilisé pour valoriser vos données.

OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...

OVHcloud

The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.

New Analytics Toolbox DevNexus 2015

Robbie Strickland

Apache Spark Fundamentals

Zahra Eskandari

Apache Spark for Everyone - Women Who Code Workshop

Amanda Casari

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

Mac Moore

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...

Lucidworks

Spark Uber Development Kit

DataWorks Summit/Hadoop Summit

Apache kylin - Big Data Technology Conference 2014 Beijing

Luke Han

Big Data training

vishal192091

Trend Micro Big Data Platform and Apache Bigtop

Evans Ye

Semelhante a Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi (20)

Data Science at Scale: Using Apache Spark for Data Science at Bitly

Big Data Processing with Apache Spark 2014

Cassandra Day 2014: Interactive Analytics with Cassandra and Spark

Apache Spark in Industry

Intro to Apache Spark by CTO of Twingo

Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra

Deep Learning on Apache® Spark™ : Workflows and Best Practices

Deep Learning on Apache® Spark™: Workflows and Best Practices

Webinar - DreamObjects/Ceph Case Study

OVHcloud Tech Talks S01E09 - OVHcloud Data Processing : Le nouveau service po...

New Analytics Toolbox DevNexus 2015

Apache Spark Fundamentals

Apache Spark for Everyone - Women Who Code Workshop

Scaling Spark Workloads on YARN - Boulder/Denver July 2015

H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smile...

Spark Uber Development Kit

Apache kylin - Big Data Technology Conference 2014 Beijing

Big Data training

Trend Micro Big Data Platform and Apache Bigtop

Mais de Databricks

DW Migration Webinar-March 2022.pptx

Databricks

The world of data architecture began with applications. Next came data warehouses. Then text was organized into a data warehouse. Then one day the world discovered a whole new kind of data that was being generated by organizations. The world found that machines generated data that could be transformed into valuable insights. This was the origin of what is today called the data lakehouse. The evolution of data architecture continues today. Come listen to industry experts describe this transformation of ordinary data into a data architecture that is invaluable to business. Simply put, organizations that take data architecture seriously are going to be at the forefront of business tomorrow. This is an educational event. Several of the authors of the book Building the Data Lakehouse will be presenting at this symposium.

Data Lakehouse Symposium | Day 1 | Part 1

Databricks

Data Lakehouse Symposium | Day 1 | Part 2

Databricks

Data Lakehouse Symposium | Day 2

Databricks

Data Lakehouse Symposium | Day 4

Databricks

In this session, learn how to quickly supplement your on-premises Hadoop environment with a simple, open, and collaborative cloud architecture that enables you to generate greater value with scaled application of analytics and AI on all your data. You will also learn five critical steps for a successful migration to the Databricks Lakehouse Platform along with the resources available to help you begin to re-skill your data teams.

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Databricks

Bad data leads to bad decisions and broken customer experiences. Organizations depend on complete and accurate data to power their business, maintain efficiency, and uphold customer trust. With thousands of datasets and pipelines running, how do we ensure that all data meets quality standards, and that expectations are clear between producers and consumers? Investing in shared, flexible components and practices for monitoring data health is crucial for a complex data organization to rapidly and effectively scale. At Zillow, we built a centralized platform to meet our data quality needs across stakeholders. The platform is accessible to engineers, scientists, and analysts, and seamlessly integrates with existing data pipelines and data discovery tools. In this presentation, we will provide an overview of our platform’s capabilities, including: Giving producers and consumers the ability to define and view data quality expectations using a self-service onboarding portal Performing data quality validations using libraries built to work with spark Dynamically generating pipelines that can be abstracted away from users Flagging data that doesn’t meet quality standards at the earliest stage and giving producers the opportunity to resolve issues before use by downstream consumers Exposing data quality metrics alongside each dataset to provide producers and consumers with a comprehensive picture of health over time

Democratizing Data Quality Through a Centralized Platform

Databricks

Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.

Learn to Use Databricks for Data Science

Databricks

Application performance monitoring (APM) has become the cornerstone of software engineering allowing engineering teams to quickly identify and remedy production issues. However, as the world moves to intelligent software applications that are built using machine learning, traditional APM quickly becomes insufficient to identify and remedy production issues encountered in these modern software applications. As a lead software engineer at NewRelic, my team built high-performance monitoring systems including Insights, Mobile, and SixthSense. As I transitioned to building ML Monitoring software, I found the architectural principles and design choices underlying APM to not be a good fit for this brand new world. In fact, blindly following APM designs led us down paths that would have been better left unexplored. In this talk, I draw upon my (and my team’s) experience building an ML Monitoring system from the ground up and deploying it on customer workloads running large-scale ML training with Spark as well as real-time inference systems. I will highlight how the key principles and architectural choices of APM don’t apply to ML monitoring. You’ll learn why, understand what ML Monitoring can successfully borrow from APM, and hear what is required to build a scalable, robust ML Monitoring architecture.

Why APM Is Not the Same As ML Monitoring

Databricks

Autonomy and ownership are core to working at Stitch Fix, particularly on the Algorithms team. We enable data scientists to deploy and operate their models independently, with minimal need for handoffs or gatekeeping. By writing a simple function and calling out to an intuitive API, data scientists can harness a suite of platform-provided tooling meant to make ML operations easy. In this talk, we will dive into the abstractions the Data Platform team has built to enable this. We will go over the interface data scientists use to specify a model and what that hooks into, including online deployment, batch execution on Spark, and metrics tracking and visualization.

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Databricks

In this talk, I will dive into the stage level scheduling feature added to Apache Spark 3.1. Stage level scheduling extends upon Project Hydrogen by improving big data ETL and AI integration and also enables multiple other use cases. It is beneficial any time the user wants to change container resources between stages in a single Apache Spark application, whether those resources are CPU, Memory or GPUs. One of the most popular use cases is enabling end-to-end scalable Deep Learning and AI to efficiently use GPU resources. In this type of use case, users read from a distributed file system, do data manipulation and filtering to get the data into a format that the Deep Learning algorithm needs for training or inference and then sends the data into a Deep Learning algorithm. Using stage level scheduling combined with accelerator aware scheduling enables users to seamlessly go from ETL to Deep Learning running on the GPU by adjusting the container requirements for different stages in Spark within the same application. This makes writing these applications easier and can help with hardware utilization and costs. There are other ETL use cases where users want to change CPU and memory resources between stages, for instance there is data skew or perhaps the data size is much larger in certain stages of the application. In this talk, I will go over the feature details, cluster requirements, the API and use cases. I will demo how the stage level scheduling API can be used by Horovod to seamlessly go from data preparation to training using the Tensorflow Keras API using GPUs. The talk will also touch on other new Apache Spark 3.1 functionality, such as pluggable caching, which can be used to enable faster dataframe access when operating from GPUs.

Stage Level Scheduling Improving Big Data and AI Integration

Databricks

In this talk, I would like to introduce an open-source tool built by our team that simplifies the data conversion from Apache Spark to deep learning frameworks. Imagine you have a large dataset, say 20 GBs, and you want to use it to train a TensorFlow model. Before feeding the data to the model, you need to clean and preprocess your data using Spark. Now you have your dataset in a Spark DataFrame. When it comes to the training part, you may have the problem: How can I convert my Spark DataFrame to some format recognized by my TensorFlow model? The existing data conversion process can be tedious. For example, to convert an Apache Spark DataFrame to a TensorFlow Dataset file format, you need to either save the Apache Spark DataFrame on a distributed filesystem in parquet format and load the converted data with third-party tools such as Petastorm, or save it directly in TFRecord files with spark-tensorflow-connector and load it back using TFRecordDataset. Both approaches take more than 20 lines of code to manage the intermediate data files, rely on different parsing syntax, and require extra attention for handling vector columns in the Spark DataFrames. In short, all these engineering frictions greatly reduced the data scientists’ productivity. The Databricks Machine Learning team contributed a new Spark Dataset Converter API to Petastorm to simplify these tedious data conversion process steps. With the new API, it takes a few lines of code to convert a Spark DataFrame to a TensorFlow Dataset or a PyTorch DataLoader with default parameters. In the talk, I will use an example to show how to use the Spark Dataset Converter to train a Tensorflow model and how simple it is to go from single-node training to distributed training on Databricks.

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Databricks

Pipelines have become ubiquitous, as the need for stringing multiple functions to compose applications has gained adoption and popularity. Common pipeline abstractions such as “fit” and “transform” are even shared across divergent platforms such as Python Scikit-Learn and Apache Spark. Scaling pipelines at the level of simple functions is desirable for many AI applications, however is not directly supported by Ray’s parallelism primitives. In this talk, Raghu will describe a pipeline abstraction that takes advantage of Ray’s compute model to efficiently scale arbitrarily complex pipeline workflows. He will demonstrate how this abstraction cleanly unifies pipeline workflows across multiple platforms such as Scikit-Learn and Spark, and achieves nearly optimal scale-out parallelism on pipelined computations. Attendees will learn how pipelined workflows can be mapped to Ray’s compute model and how they can both unify and accelerate their pipelines with Ray.

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Databricks

In this talk about zipline, we will introduce a new type of windowing construct called a sawtooth window. We will describe various properties about sawtooth windows that we utilize to achieve online-offline consistency, while still maintaining high-throughput, low-read latency and tunable write latency for serving machine learning features.We will also talk about a simple deployment strategy for correcting feature drift – due operations that are not “abelian groups”, that operate over change data.

Sawtooth Windows for Feature Aggregations

Databricks

We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark. Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue · Why? o Custom queries on top a table; We load the data once and query N times · Why not Structured Streaming · Working Solution using Redis Niche 2 : Distributed Counters · Problems with Spark Accumulators · Utilize Redis Hashes as distributed counters · Precautions for retries and speculative execution · Pipelining to improve performance

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Databricks

In the era of microservices, decentralized ML architectures and complex data pipelines, data quality has become a bigger challenge than ever. When data is involved in complex business processes and decisions, bad data can, and will, affect the bottom line. As a result, ensuring data quality across the entire ML pipeline is both costly, and cumbersome while data monitoring is often fragmented and performed ad hoc. To address these challenges, we built whylogs, an open source standard for data logging. It is a lightweight data profiling library that enables end-to-end data profiling across the entire software stack. The library implements a language and platform agnostic approach to data quality and data monitoring. It can work with different modes of data operations, including streaming, batch and IoT data. In this talk, we will provide an overview of the whylogs architecture, including its lightweight statistical data collection approach and various integrations. We will demonstrate how the whylogs integration with Apache Spark achieves large scale data profiling, and we will show how users can apply this integration into existing data and ML pipelines.

Re-imagine Data Monitoring with whylogs and Spark

Databricks

Machine learning (ML) models are typically part of prediction queries that consist of a data processing part (e.g., for joining, filtering, cleaning, featurization) and an ML part invoking one or more trained models. In this presentation, we identify significant and unexplored opportunities for optimization. To the best of our knowledge, this is the first effort to look at prediction queries holistically, optimizing across both the ML and SQL components. We will present Raven, an end-to-end optimizer for prediction queries. Raven relies on a unified intermediate representation that captures both data processing and ML operators in a single graph structure. This allows us to introduce optimization rules that (i) reduce unnecessary computations by passing information between the data processing and ML operators (ii) leverage operator transformations (e.g., turning a decision tree to a SQL expression or an equivalent neural network) to map operators to the right execution engine, and (iii) integrate compiler techniques to take advantage of the most efficient hardware backend (e.g., CPU, GPU) for each operator. We have implemented Raven as an extension to Spark’s Catalyst optimizer to enable the optimization of SparkSQL prediction queries. Our implementation also allows the optimization of prediction queries in SQL Server. As we will show, Raven is capable of improving prediction query performance on Apache Spark and SQL Server by up to 13.1x and 330x, respectively. For complex models, where GPU acceleration is beneficial, Raven provides up to 8x speedup compared to state-of-the-art systems. As part of the presentation, we will also give a demo showcasing Raven in action.

Raven: End-to-end Optimization of ML Prediction Queries

Databricks

Semantic segmentation is the classification of every pixel in an image/video. The segmentation partitions a digital image into multiple objects to simplify/change the representation of the image into something that is more meaningful and easier to analyze [1][2]. The technique has a wide variety of applications ranging from perception in autonomous driving scenarios to cancer cell segmentation for medical diagnosis. Exponential growth in the datasets that require such segmentation is driven by improvements in the accuracy and quality of the sensors generating the data extending to 3D point cloud data. This growth is further compounded by exponential advances in cloud technologies enabling the storage and compute available for such applications. The need for semantically segmented datasets is a key requirement to improve the accuracy of inference engines that are built upon them. Streamlining the accuracy and efficiency of these systems directly affects the value of the business outcome for organizations that are developing such functionalities as a part of their AI strategy. This presentation details workflows for labeling, preprocessing, modeling, and evaluating performance/accuracy. Scientists and engineers leverage domain-specific features/tools that support the entire workflow from labeling the ground truth, handling data from a wide variety of sources/formats, developing models and finally deploying these models. Users can scale their deployments optimally on GPU-based cloud infrastructure to build accelerated training and inference pipelines while working with big datasets. These environments are optimized for engineers to develop such functionality with ease and then scale against large datasets with Spark-based clusters on the cloud.

Processing Large Datasets for ADAS Applications using Apache Spark

Databricks

At Adobe Experience Platform, we ingest TBs of data every day and manage PBs of data for our customers as part of the Unified Profile Offering. At the heart of this is a bunch of complex ingestion of a mix of normalized and denormalized data with various linkage scenarios power by a central Identity Linking Graph. This helps power various marketing scenarios that are activated in multiple platforms and channels like email, advertisements etc. We will go over how we built a cost effective and scalable data pipeline using Apache Spark and Delta Lake and share our experiences. What are we storing? Multi Source – Multi Channel Problem Data Representation and Nested Schema Evolution Performance Trade Offs with Various formats Go over anti-patterns used (String FTW) Data Manipulation using UDFs Writer Worries and How to Wipe them Away Staging Tables FTW Datalake Replication Lag Tracking Performance Time!

Massive Data Processing in Adobe Using Delta Lake

Databricks

Detecting advanced email attacks at scale is a challenging ML problem, particularly due to the rarity of attacks, adversarial nature of the problem, and scale of data. In order to move quickly and adapt to the newest threat we needed to build a Continuous Integration / Continuous Delivery pipeline for the entire ML detection stack. Our goal is to enable detection engineers and data scientists to make changes to any part of the stack including joined datasets for hydration, feature extraction code, detection logic, and develop/train ML models. In this talk, we discuss why we decided to build this pipeline, how it is used to accelerate development and ensure quality, and dive into the nitty-gritty details of building such a system on top of an Apache Spark + Databricks stack.

Machine Learning CI/CD for Email Attack Detection

Databricks

Mais de Databricks (20)

DW Migration Webinar-March 2022.pptx

Data Lakehouse Symposium | Day 1 | Part 1

Data Lakehouse Symposium | Day 1 | Part 2

Data Lakehouse Symposium | Day 2

Data Lakehouse Symposium | Day 4

5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop

Democratizing Data Quality Through a Centralized Platform

Learn to Use Databricks for Data Science

Why APM Is Not the Same As ML Monitoring

The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix

Stage Level Scheduling Improving Big Data and AI Integration

Simplify Data Conversion from Spark to TensorFlow and PyTorch

Scaling and Unifying SciKit Learn and Apache Spark Pipelines

Sawtooth Windows for Feature Aggregations

Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink

Re-imagine Data Monitoring with whylogs and Spark

Raven: End-to-end Optimization of ML Prediction Queries

Processing Large Datasets for ADAS Applications using Apache Spark

Massive Data Processing in Adobe Using Delta Lake

Machine Learning CI/CD for Email Attack Detection

Último

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We are available 24*7 Booking Contact Details :- WhatsApp Chat :- +91-7014168258 If you're looking for India Call girls you've come to the right place. You'll find some of the most beautiful call girls in our location with. These ladies have pleasing personalities, hot figures, and a passion for physical pleasure. Call girls in India Lucknow Many men have booked them for their erotic and soul-mixing performances, which are sure to leave you with unforgettable memories. #K09 Escort Service India is available in the city for men and women of all ages. They can satisfy your sexual needs and will make your experience even more enjoyable and memorable. Whether you're looking for a blow-job, stripping, lovemaking, or other dirty acts, you'll be able to find a match for your tastes and budget. These highly trained professionals will help you have an unforgettable night. One Shot — 5000/in call (time 1 hour), 6000/out call Two shot with one girl — 8000/in call (time 2 hour), 10000/out call Body to body massage with sex- 8000/in call (time 1 hour) Full night Service for one person– 12000/in call, 13000/out call (shot limit 3-4 shots) Full night Service for more than 1 person — please contact Us —7014168258 We are available 24*7 all days of the year. Call us — 7014168258 Thank you for Visiting.

Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...

nirzagarg

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models We are available 24*7 Booking Contact Details :- WhatsApp Chat :- +91-7014168258 If you're looking for India Call girls you've come to the right place. You'll find some of the most beautiful call girls in our location with. These ladies have pleasing personalities, hot figures, and a passion for physical pleasure. Call girls in India Lucknow Many men have booked them for their erotic and soul-mixing performances, which are sure to leave you with unforgettable memories. #K09 Escort Service India is available in the city for men and women of all ages. They can satisfy your sexual needs and will make your experience even more enjoyable and memorable. Whether you're looking for a blow-job, stripping, lovemaking, or other dirty acts, you'll be able to find a match for your tastes and budget. These highly trained professionals will help you have an unforgettable night. One Shot — 5000/in call (time 1 hour), 6000/out call Two shot with one girl — 8000/in call (time 2 hour), 10000/out call Body to body massage with sex- 8000/in call (time 1 hour) Full night Service for one person– 12000/in call, 13000/out call (shot limit 3-4 shots) Full night Service for more than 1 person — please contact Us —7014168258 We are available 24*7 all days of the year. Call us — 7014168258 Thank you for Visiting.

Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...

gajnagarg

Switzerland Constitution 2002.pdf.........

EfruzAsilolu

Context 1. Housing Agent collected resale prices on HDB apartments in Singapore. Objective 2. To predict resale prices in to advise his potential clients. Strategies 3. Explore & Clean data for analysis. 4. Perform K-Means Clustering, in Orange, to find possible segments in the customer data. 5. Tune the model to improve its performance. 6. Visualise the findings, share conclusions, and give insight-driven recommendations. Author: Anthony mok Date: 18 Nov 2023 Email: xxiaohao@yahoo.com

Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange

ThinkInnovation

原版定制【微信:176555708】【伦敦大学毕业证（UoL毕业证书）】【微信:176555708】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！

怎样办理伦敦大学毕业证（UoL毕业证书）成绩单学校原版复制

vexqp

Aspirational Block Program Block Syaldey District - Almora

GovindSinghDasila

原版定制【微信:176555708】【圣路易斯大学毕业证（SLU毕业证书）】【微信:176555708】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！

怎样办理圣路易斯大学毕业证（SLU毕业证书）成绩单学校原版复制

vexqp

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models We are available 24*7 Booking Contact Details :- WhatsApp Chat :- +91-7014168258 If you're looking for India Call girls you've come to the right place. You'll find some of the most beautiful call girls in our location with. These ladies have pleasing personalities, hot figures, and a passion for physical pleasure. Call girls in India Lucknow Many men have booked them for their erotic and soul-mixing performances, which are sure to leave you with unforgettable memories. #K09 Escort Service India is available in the city for men and women of all ages. They can satisfy your sexual needs and will make your experience even more enjoyable and memorable. Whether you're looking for a blow-job, stripping, lovemaking, or other dirty acts, you'll be able to find a match for your tastes and budget. These highly trained professionals will help you have an unforgettable night. One Shot — 5000/in call (time 1 hour), 6000/out call Two shot with one girl — 8000/in call (time 2 hour), 10000/out call Body to body massage with sex- 8000/in call (time 1 hour) Full night Service for one person– 12000/in call, 13000/out call (shot limit 3-4 shots) Full night Service for more than 1 person — please contact Us —7014168258 We are available 24*7 all days of the year. Call us — 7014168258 Thank you for Visiting.

Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...

gajnagarg

#Dubai Call Girls Agency +971525547819 #Indian And Pakistani Call Girls Dubai #Dubai Indian Call Girls Agency Class Call Girls In Dubai #First Class Call Girls In Dubai #Full Massage Services Call Girls In Dubai #Al Jaddaf,Al Jaffiliya,Business Bay,Al Karama,Bur Dubai,Deira,Dubai,Palm Jumeirah,Al Wasl,Trade Centre,Dubai Mall,JBR,JVC,JLT,Discovery Garden #Dubai Call Girls Services Provide In Ajman_Dubai_RAK_UMQ_Fujairah_Abu_Dhabi#Indian #Tamil #Kerala #Russian #Philippine #Morocco #Thailand #English Models In Dubai #If You Want Serv#Dubai Pakistani Call Girls Agency #Beautiful Call Girls in Dubai #High ices Just Send Me Text On Whatsapp +971525547819 #Website Link http://Dubaicallgirls.pro https://chatwith.io/s/65d1df48b2992

Dubai Call Girls Peeing O525547819 Call Girls Dubai

kojalkojal131

Digital advertising, or paid media, encompasses the strategic deployment of online advertisements to reach target audiences efficiently and effectively. This includes any digital platform that supports advertising to deliver unique messages for any objective. Understanding the mechanics of digital advertising platforms, along with insights into audience behaviors and preferences, allows marketers to optimize their ad spend and achieve significant engagement and conversion rates. This lecture is for Advanced Digital & Social Media Strategy (MGMTX 466.05) at UCLA Extension.

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Valters Lauzums

In my capstone project, I investigated the impact of COVID-19 on education. Using data analysis and statistical methods, I explored various aspects such as enrollment trends, access to resources, and socioeconomic disparities. I found a significant association between children missing classes and a lack of internet connection at home, as well as between household financial situations and children's enrollment in school. These findings highlight the importance of addressing disparities in internet access, household finances, and geographical location to ensure equal educational opportunities for all students during and beyond the pandemic.

Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION

LakpaYanziSherpa

Gartner's Data Analytics Maturity Model.pptx

chadhar227

Abortion pills in Kuwait city !🌆+918761049707^) Where to get cytotec pills in salmiyah, WhatsApp '''(+918761049707)''' Explore Discreet and Safe Abortion Pill Options in Dubai, Empowering Women With Confidential and Medically Supervised Choices For Reproductive Health" Abortion Pills In Dubai, Abu Dhabi/Alain/Sharjah/RAK City Satwa-Mifepristone and Misoprostol Available in Dubai/Abu Dhabi - i Pills in Dubai and Cost Of Cytotec in WhatsApp (+918761049707) Abortion Pills in Dubai/Abu Dubai/ Abu Dhabi. Are you stranded with unwanted pregnancy in Dubai, Abu Dhabi , the United Arab Emirates (U.A.E), Qatar, Oman, Saudi Arabia or Kuwait? you can now contact us now on Whatsapp Dr AJ:+918761049707to buy safe abortion pills In Dubai, Abu Dhabi, Sharjah, Al Ain, Ajman, RAK City, Ras Al Khaimah and Fujairah to terminate an unwanted pregnancy in Dubai and the United Arab Emirates. Get your Discreet 100% Safe (+918761049707 )*Effective Abortion Pills For Sale in Dubai, Abu Dhabi, KUWAIT, QATAR, BAHRAIN, DOHA, SALMIYA, Sharjah, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE. BUY Mifepristone and Misoprostol (Cytotec), Mtp Kit In UAE. Abortion pills available in UAE (United Arab Emirates), Saudi Arabia, Kuwait, Oman, Bahrain and Qatar. Contact us today. +918761049707 -The UAE’s leading abortion care service in Dubai. Abortion Treatment. Medical Abortion. Surgical Abortion. Find A Clinic like Dr Maria Abortion clinic in Dubai We have Abortion Pills / Cytotec Tablets Available in Dubai, Abu Dhabi, KUWAIT, QATAR, BAHRAIN, DOHA, SALMIYA, Sharjah, Ajman, Alain, Fujairah, Ras Al Khaimah, Umm Al Quwain, UAE., buy cytotec in Dubai abortion Pills Cytotec also available Oman Qatar Doha Saudi Arabia Bahrain We sell original abortion medicine which includes: Cytotec 200mcg (Misoprostol), Mifepristone, Mifegest-kit, Misoclear, Emergency contraceptive pills, Morning after sex pills, ipills, pills to prevent pregnancy 72 hours after sex. All our pills are manufactured by reputable medical manufacturing companies like PFIZER. Medical abortion is easy and effective for everyone to perform in their own privacy. There are very few complications that may arise from medical abortion if one follows the right guidelines as instructed by the obstetrician. Abortion Pills in Dubai Can Now Be Offered at Dr Maria Abortion clinic in Dubai, F.D.A. Mifepristone and Misoprostol, the first of two drugs in medication abortions, previously had to be dispensed only by clinics, doctors or a few mail-order pharmacies like Dr Maria Abortion clinic in Dubai . Now, We can provide it. For the first time, retail pharmacies, like Dr Maria Abortion clinic in Dubai, will be Able to offer abortion pills in Dubai under a regulatory change made Tuesday by the Food and Drug Administration. The action could significantly expand access to abortion through medication. Cytotec Abortion Pills are Available In Dubai / UAE, you will be very happy to do abortion in Dubai we are providing cytotec 200mg

In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia

ahmedjiabur940

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models We are available 24*7 Booking Contact Details :- WhatsApp Chat :- +91-7014168258 If you're looking for India Call girls you've come to the right place. You'll find some of the most beautiful call girls in our location with. These ladies have pleasing personalities, hot figures, and a passion for physical pleasure. Call girls in India Lucknow Many men have booked them for their erotic and soul-mixing performances, which are sure to leave you with unforgettable memories. #K09 Escort Service India is available in the city for men and women of all ages. They can satisfy your sexual needs and will make your experience even more enjoyable and memorable. Whether you're looking for a blow-job, stripping, lovemaking, or other dirty acts, you'll be able to find a match for your tastes and budget. These highly trained professionals will help you have an unforgettable night. One Shot — 5000/in call (time 1 hour), 6000/out call Two shot with one girl — 8000/in call (time 2 hour), 10000/out call Body to body massage with sex- 8000/in call (time 1 hour) Full night Service for one person– 12000/in call, 13000/out call (shot limit 3-4 shots) Full night Service for more than 1 person — please contact Us —7014168258 We are available 24*7 all days of the year. Call us — 7014168258 Thank you for Visiting.

Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...

gajnagarg

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We are available 24*7 Booking Contact Details :- WhatsApp Chat :- +91-7014168258 If you're looking for India Call girls you've come to the right place. You'll find some of the most beautiful call girls in our location with. These ladies have pleasing personalities, hot figures, and a passion for physical pleasure. Call girls in India Lucknow Many men have booked them for their erotic and soul-mixing performances, which are sure to leave you with unforgettable memories. #K09 Escort Service India is available in the city for men and women of all ages. They can satisfy your sexual needs and will make your experience even more enjoyable and memorable. Whether you're looking for a blow-job, stripping, lovemaking, or other dirty acts, you'll be able to find a match for your tastes and budget. These highly trained professionals will help you have an unforgettable night. One Shot — 5000/in call (time 1 hour), 6000/out call Two shot with one girl — 8000/in call (time 2 hour), 10000/out call Body to body massage with sex- 8000/in call (time 1 hour) Full night Service for one person– 12000/in call, 13000/out call (shot limit 3-4 shots) Full night Service for more than 1 person — please contact Us —7014168258 We are available 24*7 all days of the year. Call us — 7014168258 Thank you for Visiting.

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...

nirzagarg

Digital Transformation Playbook by Graham Ware

Graham Ware

Saudi Arabia [ Abortion pills) Jeddah/riaydh/dammam/+966572737505☎️] cytotec tablets uses abortion pills 💊💊 How effective is the abortion pill? 💊💊 +966572737505) "Abortion pills in Jeddah" how to get cytotec tablets in Riyadh " Abortion pills in dammam*💊💊 The abortion pill is very effective. If you’re taking mifepristone and misoprostol, it depends on how far along the pregnancy is, and how many doses of medicine you take:💊💊 +966572737505) how to buy cytotec pills At 8 weeks pregnant or less, it works about 94-98% of the time. +966572737505[ 💊💊💊 At 8-9 weeks pregnant, it works about 94-96% of the time. +966572737505) At 9-10 weeks pregnant, it works about 91-93% of the time. +966572737505)💊💊 If you take an extra dose of misoprostol, it works about 99% of the time. At 10-11 weeks pregnant, it works about 87% of the time. +966572737505) If you take an extra dose of misoprostol, it works about 98% of the time. In general, taking both mifepristone and+966572737505 misoprostol works a bit better than taking misoprostol only. +966572737505 Taking misoprostol alone works to end the+966572737505 pregnancy about 85-95% of the time — depending on how far along the+966572737505 pregnancy is and how you take the medicine. +966572737505 The abortion pill usually works, but if it doesn’t, you can take more medicine or have an in-clinic abortion. +966572737505 When can I take the abortion pill?+966572737505 In general, you can have a medication abortion up to 77 days (11 weeks)+966572737505 after the first day of your last period. If it’s been 78 days or more since the first day of your last+966572737505 period, you can have an in-clinic abortion to end your pregnancy.+966572737505 Why do people choose the abortion pill? Which kind of abortion you choose all depends on your personal+966572737505 preference and situation. With+966572737505 medication+966572737505 abortion, some people like that you don’t need to have a procedure in a doctor’s office. You can have your medication abortion on your own+966572737505 schedule, at home or in another comfortable place that you choose.+966572737505 You get to decide who you want to be with during your abortion, or you can go it alone. Because+966572737505 medication abortion is similar to a miscarriage, many people feel like it’s more “natural” and less invasive. And some+966572737505 people may not have an in-clinic abortion provider close by, so abortion pills are more available to+966572737505 them. +966572737505 Your doctor, nurse, or health center staff can help you decide which kind of abortion is best for you. +966572737505 More questions from patients: Saudi Arabia+966572737505 CYTOTEC Misoprostol Tablets. Misoprostol is a medication that can prevent stomach ulcers if you also take NSAID medications. It reduces the amount of acid in your stomach, which protects your stomach lining. The brand name of this medication is Cytotec®.+966573737505) Unwanted Kit is a combination of two medicin

Abortion pills in Jeddah | +966572737505 | Get Cytotec

Abortion pills in Riyadh +966572737505 get cytotec

PLE-statistics document for primary schs

cnajjemba

Discover Why Less is More in B2B Research

michael115558

原版定制【微信:176555708】【圣地亚哥州立大学毕业证（SDSU毕业证书）】【微信:176555708】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。【业务选择办理准则】一、工作未确定，回国需先给父母、亲戚朋友看下文凭的情况，办理一份就读学校的毕业证【微信176555708】文凭即可二、回国进私企、外企、自己做生意的情况，这些单位是不查询毕业证真伪的，而且国内没有渠道去查询国外文凭的真假，也不需要提供真实教育部认证。鉴于此，办理一份毕业证【微信176555708】即可三、进国企，银行，事业单位，考公务员等等，这些单位是必需要提供真实教育部认证的，办理教育部认证所需资料众多且烦琐，所有材料您都必须提供原件，我们凭借丰富的经验，快捷的绿色通道帮您快速整合材料，让您少走弯路。留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才留信网服务项目： 1、留学生专业人才库服务（留信分析） 2、国（境）学习人员提供就业推荐信服务 3、留学人员区块链存储服务 → 【关于价格问题（保证一手价格）】我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！

怎样办理圣地亚哥州立大学毕业证（SDSU毕业证书）成绩单学校原版复制

vexqp

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

1. Luke Han, luke@kyligence.io Shaofeng Shi, shaofeng@kyligence.io Apache Kylin: Speed up Cubing with Spark

2. About Us Shaofeng Shi (史少锋) • Apache Kylin PMC • Sr. Architect, Kyligence Inc. Luke Han (韩卿) • VP of Apache Kylin • Co-founder & CEO, Kyligence Inc. formed by the team who created Apache Kylin http://kyligence.io

3. Agenda • What is Apache Kylin • Challenges with MapReduce • Speed up Cubing with Spark • Q&A

4. About Apache Kylin ü Leading open source OLAP on Hadoop ü Fast growing open source community ü Adopted by 200+ global organizations ü First born in China Apache Top Level Project ü InfoWorld BossieAward: – Best Open Source Big Data Tool (2015) – Best Open Source Big Data Tool (2016) Kylin / ˈkiːˈlɪn / 麒麟 -- n. (in Chinese art) a mythical animal of composite form Apache Kylin Extreme OLAP Engine for Big Data http://kylin.apache.org

5. Global Users 200+ use cases in production

6. Use Case

7. About Apache Kylin • Extreme OLAP Engine on Hadoop ü High Performance ü High Concurrency ü ANSI SQL ü Native on Hadoop ü Cloud Ready

8. The Basic: OLAP Cube • Pre-calculate metrics by dimensions

9. The Magic: Pre-Calculation Sort Aggr Filter Tables Join Sort Filter Online:Query from Cube Post Aggr. SELECT returned, status, sum(quantity), sum(price) FROM lineitem INNER JOIN orders on l_orderkey =o_orderkey WHERE shipdate =’2016-09-16' GROUP BY returned, status ORDER BY returned, status; Offline:Cubing Cube

10. Apache Kylin: O(1) O(N) O(1) Apache Kylin Data Size Response Time Other Engines

11. Star-Schema Benchmark Run SSB at 10, 20 and 40 million-row scales, Kylin’s response time keep stable Read more at: https://github.com/kyligence/ssb-kylin Kylin vs Hive, O(1) : O(N)

12. Architecture

13. Cubing Process Cubing: Map -> Shuffle -> Reduce

14. Build Cube with MapReduce • Calculate Cuboids by layer :N dim (Base cuboid), N-1 dim, N-2…, 1, 0 • Reuse previous layer’s result • HDFS used for data sharing • Totally need N round MR;

15. Challenges with MapReduce • Slow data sharing ü Serialization, Replication, Deserialization… • Repeated job submission ü Submit dependent jar/files repeatedly ü Re-queue when cluster is busy • Short of streaming support ü Couldn’t support low latency data processing • Limited storage support ü Couldn’t ingest data from non-HDFS

16.

17. Speed up Cubing with Spark • Abstract each layer cuboids as a RDD • Cache parent RDD to generate Child RDD • Export RDD when Child be generated • As-is measure aggregators can be reused with Spark Java API RDD-1 RDD-2 RDD-3 RDD-4 RDD-5

18. Speed up Cubing with Spark Reuse data in memory

19. Speed up Cubing with Spark One job finishes all layers aggregation!

20. Performance Benchmark • Environment ü 4 nodes Hadoop cluster; each has 28 GB RAM and 12 cores ü CDH 5.8, Apache Kylin 2.0 • Spark ü Spark 1.6.3 on YARN ü 6 executors, each has 4 cores, 5GB memory • Test Data ü Airline data of US DoT, totally 160 million rows ü Cube:10 dimensions, 5 measures (SUM) • Test Scenarios ü Build the cube at different scale: 3 million, 50 million and 160 million rows

21. Spark Cubing vs. MR Cubing Build Cube Time Comparison BuildTime(minute) 0 8 15 23 30 Source data rows 3-million 50-million 160-million MapReduce Spark MapReduce Spark MapReduce Spark 17 8 3 29 19 6

22. Spark Cubing vs. MR In-Mem Cubing MR In-mem is slow when dataset is big and random distributed When data is in sharding, MR In-mem can be fast Spark is fast & stable regardless of data distribution

23. Benefits with Spark • Spark speeds up Cubing at 1x ü Half time be saved • Spark simplifies Kylin’s development ü More functions, less codes • Spark brings Kylin to a new era ü Real-time OLAP, Ad hoc query, Cloud integration

24. Thank You. Join Apache Kylin community dev@kylin.apache.org Visit Kylin & Kyligence home http://kylin.apache.org http://kyligence.io

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi

Semelhante a Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi (20)

Mais de Databricks

Mais de Databricks (20)

Último

Último (20)

Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi