SlideShare uma empresa Scribd logo
1 de 48
Baixar para ler offline
Hopsworks – Self-Service Spark/Flink/Kafka/Hadoop
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
Apache BigData Europe, 2016
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 2/58
Hadoop is not a cool kid anymore!
Where did it go wrong for Hadoop?
•Data Engineers/Scientists
- Where is the User-Friendly tooling and Self-Service?
- How do I install/operate anything other than a sandbox VM?
•Operations Folks
- Security model has become incomprehensible
(Sentry/Ranger)
- Major distributions not open enough for patching
- Sensitive data still requires its own cluster
- Why not just use AWS EMR/GCE/Databricks/etc ?!?
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 3/58
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 4/58
Is this Hadoop?
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 5/58
Mesos KubernetesYARNResource
Manager
Storage HDFS GCSS3 WFS
On-Premise AWS GCEAzurePlatform
Processing
MR
TensorflowSpark Flink
Hive
HBase
Presto Kafka
How about this?
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 6/58
Mesos KubernetesYARNResource
Manager
Storage HDFS GCSS3 WFS
On-Premise AWS GCEAzurePlatform
Processing
MR
TensorflowSpark Flink
Hive
HBase
Presto Kafka
Here’s Hops Hadoop Distribution
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 7/58
YARNResource
Manager
Storage HDFS
On-Premise GCEAWSPlatform
Processing
Logstash
TensorflowSpark
Flink
Kafka
Hopsworks Elasticsearch
Kibana Grafana
Hadoop Distributions Simplify Things
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 8/58
Cloudera MgrKaramel/ChefAmbariInstall /
Upgrade
YARN
HDFS
On-Premise
MR TensorflowSpark FlinkKafka
Cloud-Native means Object Stores, not HDFS
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 9/58
Object Stores and False Equivalence*
•Object Stores are inferior to hierarchical filesystems
- False equivalence in the tech media
•Object stores can scale better,
but at what cost?
- Read-your-writes existing objs#
- Write, then list
- Hierarchical namespace properties
• Quotas, permissions
- Other eventual consistency probs
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 10/58
False Equivalence*: “unfairly equating a minor failing of Hillary Clinton’s to a
major failing of Donald Trump’s.”
Object Store
Hierarchical Filesystem
Eventual consistency in Object Stores
Implement your own
eventually consistent
metadata store to mirror
the (unobservable)
metadata
•Netflix for AWS*
- s3mpr
•Spotify for GCS
- Tried porting s3mper,
now own solution.
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 11/58
http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html
Can we open up the ObjectStore’s metadata?
2016-11-14 12/58
Object Store metadata is a Pandora’s Box. Best keep it closed.
Eventual Consistency is Hard
Ericssson Machine Learning Event 13/58
if (r1 == null) {
// Loop until read completes
} else if (r1 == W1){
// do Y
} else if (r1 == W2) {
// do Z
}
r1 = read();
S3 guarantees
Prediction
2016-11-14 14/58
NoSQL systems grew from the need to scale-out relational databases.
NewSQL systems bring both scalability and strong consistency,
and companies moving back to strong consistency.*
*http://www.cubrid.org/blog/dev-platform/spanner-globally-distributed-database-by-google/
Object stores systems grew from the need to scale-out filesystems.
A new generation of hierarchical filesystem will appear that bring both
scalability and hierarchy, and companies will eventually move back to
scalable hierarchical filesystems.
HopsFS and Hops YARN
2016-11-14 15/58
16x
Performance
Externalizing the NameNode/YARN State
•Problem: Metadata is preventing Hadoop from
becoming the great platform we know she can be!
•Solution: Move the metadata off the JVM Heap
•Where?
To an in-memory database that can be
transactionally and efficiently queried and managed.
The database should be Open-Source. We chose
NDB – aka MySQL Cluster.
16
HopsFS – 1.2 million ops/sec (16X HDFS)
2016-11-14 17/58NDB Setup: Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE.
NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
HopsFS Metadata Scaleout
18Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop
HopsFS Architecture
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 19/58
NameNodes
NDB
Leader
HDFS Client
DataNodes
Hops YARN Architecture
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 20/58
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Heartbeats
(70-95%)
AM Reqs
(5-30%)
Extending Metadata in Hops
•JSON API (with and without schema)
public void attachMetadata(Json obj, String pathToFileorDir)
• Row added with jsonObj & foreign key to the inode (on delete cascade)
• Use to annotate files/directories for search in Elasticsearch
•Add Tables in the database
• Foreign keys to inode/applications ensure metadata integrity
• Transactions involving both the inode(s) and metadata ensure
metadata consinstency
• Enables modular extensions to Namenodes and ResourceManagers
2016-11-14 21/58
id json_obj inode (fk)
1 { trump: non} 10101
inode_id attrs
10101 /tmp…
Strongly Consistent Metadata
22
Files
Directories
Containers
Provenance
Security
Quotas
Projects
Datasets
NDB
2-phase commit
Metadata Integrity maintained using
2PC and Foreign Keys.
Eventually Consistent Metadata with Epipe
23
Topics
Projects
Directories
Files
NDB
Kafka async. replication
Metadata Integrity maintained by
one-way Asynchronous Replication.
[ePipe Tutorial, BOSS Workshop, VLDB 2016]
Epipe
Elasticsearch
Maintaining Eventually Consistent Metadata
24
Topics
Partitions
ACLs
Zookeeper/KafkaDatabase
Metadata integrity maintained by
custom recovery logic and polling.
Hopsworks
polling
Tinker-Friendly Metadata
•Erasure Coding in
HopsFS
•Free-Text search of
HopsFS namespace
- Export changelog for
NDB to Elasticsearch
•Dynamic User Roles in
HopsFS
•New abstractions:
Projects and
Datasets
Extensible, tinker-friendly metadata
is like 3-d printing for Hadoop
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 25/58
Leveraging Extensible Metadata in Hops
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 26/58
New Concepts in Hops Hadoop
Hops
•Projects
- Datasets
- Topics
- Users
- Jobs
- Notebooks
Hadoop
•Users
•Applications
•Jobs
•Files
•Clusters
•ACLs
•Sys Admins
•Kerberos
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 27/58
Enabled by Extensible Metadata
Notebooks
Simplifying Hadoop with Projects
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 28/58
App Logs Datasets
Kafka Topics
HDFS files/dir
Project
Users Access Control
Metadata in Hops
Metadata not in Hops
Simplifying Hadoop with Datasets
•Datasets are a directory
subtree in HopsFS
- Can be shared securely
between projects
- Indexed by Elasticsearch
•Datasets can be made
public and downloaded
from any Hopsworks
cluster anywhere in the
world
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 29/58
Hops Metadata Summarized
Elasticsearch
Kafka
Projects
Datasets
------------------
Files
Directories
Containers
Provenance
Security
Quotas
Hopsworks
REST API
Epipe
(DB Changelog)
The Distributed Database is the Single Source-of-Truth for Metadata
Eventual
Consistencymutation
2-Phase Commit
in NDB
Hopsworks
ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 31
Project-Based Multi-Tenancy
ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 32
•A project is a collection of
- Users with Roles
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•Per-Project quotas
- Storage in HDFS
- CPU in YARN
• Uber-style Pricing
•Sharing across Projects
- Datasets/Topics
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
Project Roles
Data Owner Privileges
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist Privileges
- Write and Run code
33
We delegate administration of privileges to users
Hopsworks – Dynamic Roles
34
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
Hopsworks
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates
X.509 Certificates Everywhere
•User and service certs share same self-signed root CA
•Project-Specific User Certificates
- Every user in every project is issued with a X.509 certificate,
containing the project-specific userID.
• Scales using intermediate CAs at each Hopsworks instance.
• Inspired by Netflix’ BLESS system.
•Service Certificates
- Services use a host-specific certificate that is signed by the
root CA. Process managed by an agent program (kagent).
- Services identify SSL clients by extracting the CommonName
from client certificate in RPC calls. Kerberos keytab is gone.
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 35/58
X.509 Certificate Generation
36
Alice@gmail.com
Users don’t see the certificates.
Users authenticate using:
• LDAP,
• password,
• 2-Factor Authentication
Add/Del
Users
Distributed
Database
Insert/Remove CertsProject
Mgr
Root
CA
Kagent
HDFS
Spark
Kafka
YARN
Cert Signing
Requests
Intermediate
Certificate
Authority
Hopsworks
Simplifying Flink/Spark Streaming
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 37/58
Alice@gmail.com
1. Launch Spark Job
Distributed
Database
2. Get certs,
service endpoints
YARN Private
LocalResources
Spark/Flink Streaming App
4. Materialize certs
3. YARN Job, config
6. Get Schema
7. Consume
Produce
5. Read Certs
Hopsworks
KafkaUtil
8. Authenticate
Secure Kafka Producer Application
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 38/58
Developer
Operations
1.Discover: Schema Registry and Kafka Broker Endpoints
2.Create: Kafka Properties file with certs and broker details
3.Create: producer using Kafka Properties
4.Download: the Schema for the Topic from the Schema Registry
5.Distribute: X.509 certs to all hosts on the cluster
6.Cleanup securely
All of these steps are now down automatically by Hopsworks’ KafkaUtil library
Hops simplifies Secure Flink/Kafka Producer
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList);
props.put(SCHEMA_REGISTRY_URL, restApp.restConnect);
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
org.apache.kafka.common.serialization.StringSerializer.class);
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
io.confluent.kafka.serializers.KafkaAvroSerializer.class);
props.put("producer.type", "sync");
props.put("serializer.class","kafka.serializer.StringEncoder");
props.put("request.required.acks", "1");
props.put("ssl.keystore.location","/var/ssl/kafka.client.keysto
re.jks")
props.put("ssl.keystore.password","test1234 ")
props.put("ssl.key.password","test1234")
ProducerConfig config = new ProducerConfig(props);
String userSchema = "{"namespace": "example.avro",
"type": "record", "name": "User"," +
""fields": [{"name": "name",
"type": "string"}]}";
Schema.Parser parser = new Schema.Parser();
Schema schema = parser.parse(userSchema);
GenericRecord avroRecord = new GenericData.Record(schema);
avroRecord.put("name", "testUser");
Producer<String, String> producer = new Producer<String,
String>(config);
ProducerRecord<String, Object> message = new
ProducerRecord<>(“topicName”, avroRecord );
producer.send(data);
39/8
Losts of Hard-Coded Endpoints Here!
StreamExecutionEnvironment
env = …
FlinkProducer prod =
KafkaUtil.getFlinkProducer
(topicName);
DataStream<…> ms =
env.addSource(…);
ms.addSink(producer);
env.execute(“Producing");
Massively Simplified
Code for Hops/Flink/Kafka
Zeppelin Support for Spark/Livy
40
Spark Jobs in YARN with Livy
ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 41
[Image from: http://gethue.com]
Debugging Spark Jobs with Dr Elephant
•Project-specific view of performance and correctness
issues for completed Spark Jobs
•Customizable heuristics
•Doesn’t show killed jobs
2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 42/58
Hops in Production at www.hops.site
43
SICS ICE: A datacenter research and test environment
Purpose: Increase knowledge, strengthen universities, companies and researchers
Karamel/Chef for Automated Installation
44
Google Compute Engine BareMetal EGI/Shibboleth
Demo
5/30/2012 www.hops.io 45
Summary
•Hops is the only European distribution of Hadoop
- More scalable, tinker-friendly, and open-source.
•Hopsworks provides first-class support for
Flink-/Spark-Kafka-as-a-Service
- Streaming or Batch Jobs
- Zeppelin Notebooks
•Hopsworks provides best-in-class support for secure
streaming applications with Kafka
2016-11-14 46
Hops Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou,
Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis,
Antonios Kouzoupis, Ermias Gebremeskel.
Alumni: Vasileios Giannokostas, Johan Svedlund Nordström, Rizvi Hasan,
Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn,
K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré,
Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro,
Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi,
Gayana Chandrasekara, Nikolaos Stanogias, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik,
Lalith Suresh, Mariano Valles, Ying Lieu.
Hops
[Hadoop For Humans]
http://github.com/hopshadoop
http://www.hops.io

Mais conteúdo relacionado

Mais procurados

Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksJamie Grier
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormJungtaek Lim
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraDatabricks
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virendervithakur
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinFlink Forward
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Mobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkMobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkSpark Summit
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzDatabricks
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaDatabricks
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedJamie Grier
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit
 

Mais procurados (20)

Migrating pipelines into Docker
Migrating pipelines into DockerMigrating pipelines into Docker
Migrating pipelines into Docker
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen ShapiraStream All Things—Patterns of Modern Data Integration with Gwen Shapira
Stream All Things—Patterns of Modern Data Integration with Gwen Shapira
 
Streaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_VirenderStreaming Sensor Data Slides_Virender
Streaming Sensor Data Slides_Virender
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and ZeppelinJim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Mobius: C# Language Binding For Spark
Mobius: C# Language Binding For SparkMobius: C# Language Binding For Spark
Mobius: C# Language Binding For Spark
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
 
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan VolzArchiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
Archiving, E-Discovery, and Supervision with Spark and Hadoop with Jordan Volz
 
Securing Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise ContextSecuring Hadoop in an Enterprise Context
Securing Hadoop in an Enterprise Context
 
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis MagdaApache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
Apache Spark and Apache Ignite: Where Fast Data Meets the IoT with Denis Magda
 
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingBuilding Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
 
Stateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory SpeedStateful Stream Processing at In-Memory Speed
Stateful Stream Processing at In-Memory Speed
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar VeliqiSpark Summit EU talk by Ruben Pulido Behar Veliqi
Spark Summit EU talk by Ruben Pulido Behar Veliqi
 

Semelhante a Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Hopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIHopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIQAware GmbH
 
ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302Timothy Spann
 
Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullJim Dowling
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiSlim Baltagi
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Con LA
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Frank Munz
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)DataWorks Summit
 
Shug meetup Hops Hadoop
Shug meetup Hops HadoopShug meetup Hops Hadoop
Shug meetup Hops HadoopJim Dowling
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysDemi Ben-Ari
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it bettergvernik
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?gvernik
 

Semelhante a Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop (20)

Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Hopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AIHopsworks - The Platform for Data-Intensive AI
Hopsworks - The Platform for Data-Intensive AI
 
ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302
 
Spark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-fullSpark summit-east-dowling-feb2017-full
Spark summit-east-dowling-feb2017-full
 
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-BaltagiApache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
Data Lakes with Azure Databricks
Data Lakes with Azure DatabricksData Lakes with Azure Databricks
Data Lakes with Azure Databricks
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
Data processing at the speed of 100 Gbps@Apache Crail (Incubating)
 
Shug meetup Hops Hadoop
Shug meetup Hops HadoopShug meetup Hops Hadoop
Shug meetup Hops Hadoop
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Hadoop and object stores can we do it better
Hadoop and object stores  can we do it betterHadoop and object stores  can we do it better
Hadoop and object stores can we do it better
 
Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?Hadoop and object stores: Can we do it better?
Hadoop and object stores: Can we do it better?
 

Mais de Jim Dowling

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfJim Dowling
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfJim Dowling
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleJim Dowling
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfJim Dowling
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdfJim Dowling
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Jim Dowling
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022Jim Dowling
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Jim Dowling
 
Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Jim Dowling
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigmJim Dowling
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money LaunderingJim Dowling
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingJim Dowling
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityJim Dowling
 
Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020Jim Dowling
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines Jim Dowling
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyJim Dowling
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleJim Dowling
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019Jim Dowling
 

Mais de Jim Dowling (20)

ARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdfARVC and flecainide case report[EI] Jim.docx.pdf
ARVC and flecainide case report[EI] Jim.docx.pdf
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdfPyData Berlin 2023 - Mythical ML Pipeline.pdf
PyData Berlin 2023 - Mythical ML Pipeline.pdf
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdfPyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf_Python Ireland Meetup - Serverless ML - Dowling.pdf
_Python Ireland Meetup - Serverless ML - Dowling.pdf
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning Building Hopsworks, a cloud-native managed feature store for machine learning
Building Hopsworks, a cloud-native managed feature store for machine learning
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022Real-Time Recommendations  with Hopsworks and OpenSearch - MLOps World 2022
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021Hops fs huawei internal conference july 2021
Hops fs huawei internal conference july 2021
 
Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21Hopsworks MLOps World talk june 21
Hopsworks MLOps World talk june 21
 
Hopsworks Feature Store 2.0 a new paradigm
Hopsworks Feature Store  2.0   a new paradigmHopsworks Feature Store  2.0   a new paradigm
Hopsworks Feature Store 2.0 a new paradigm
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
GANs for Anti Money Laundering
GANs for Anti Money LaunderingGANs for Anti Money Laundering
GANs for Anti Money Laundering
 
Berlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowlingBerlin buzzwords 2020-feature-store-dowling
Berlin buzzwords 2020-feature-store-dowling
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala UniversityInvited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
 
Hopsworks data engineering melbourne april 2020
Hopsworks   data engineering melbourne april 2020Hopsworks   data engineering melbourne april 2020
Hopsworks data engineering melbourne april 2020
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and MaggyAsynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
 
Hopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, SunnyvaleHopsworks at Google AI Huddle, Sunnyvale
Hopsworks at Google AI Huddle, Sunnyvale
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
 

Último

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Último (20)

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

  • 1. Hopsworks – Self-Service Spark/Flink/Kafka/Hadoop Jim Dowling Associate Prof @ KTH Senior Researcher @ SICS Apache BigData Europe, 2016
  • 2. 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 2/58 Hadoop is not a cool kid anymore!
  • 3. Where did it go wrong for Hadoop? •Data Engineers/Scientists - Where is the User-Friendly tooling and Self-Service? - How do I install/operate anything other than a sandbox VM? •Operations Folks - Security model has become incomprehensible (Sentry/Ranger) - Major distributions not open enough for patching - Sensitive data still requires its own cluster - Why not just use AWS EMR/GCE/Databricks/etc ?!? 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 3/58
  • 4. 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 4/58
  • 5. Is this Hadoop? 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 5/58 Mesos KubernetesYARNResource Manager Storage HDFS GCSS3 WFS On-Premise AWS GCEAzurePlatform Processing MR TensorflowSpark Flink Hive HBase Presto Kafka
  • 6. How about this? 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 6/58 Mesos KubernetesYARNResource Manager Storage HDFS GCSS3 WFS On-Premise AWS GCEAzurePlatform Processing MR TensorflowSpark Flink Hive HBase Presto Kafka
  • 7. Here’s Hops Hadoop Distribution 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 7/58 YARNResource Manager Storage HDFS On-Premise GCEAWSPlatform Processing Logstash TensorflowSpark Flink Kafka Hopsworks Elasticsearch Kibana Grafana
  • 8. Hadoop Distributions Simplify Things 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 8/58 Cloudera MgrKaramel/ChefAmbariInstall / Upgrade YARN HDFS On-Premise MR TensorflowSpark FlinkKafka
  • 9. Cloud-Native means Object Stores, not HDFS 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 9/58
  • 10. Object Stores and False Equivalence* •Object Stores are inferior to hierarchical filesystems - False equivalence in the tech media •Object stores can scale better, but at what cost? - Read-your-writes existing objs# - Write, then list - Hierarchical namespace properties • Quotas, permissions - Other eventual consistency probs 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 10/58 False Equivalence*: “unfairly equating a minor failing of Hillary Clinton’s to a major failing of Donald Trump’s.” Object Store Hierarchical Filesystem
  • 11. Eventual consistency in Object Stores Implement your own eventually consistent metadata store to mirror the (unobservable) metadata •Netflix for AWS* - s3mpr •Spotify for GCS - Tried porting s3mper, now own solution. 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 11/58 http://techblog.netflix.com/2014/01/s3mper-consistency-in-cloud.html
  • 12. Can we open up the ObjectStore’s metadata? 2016-11-14 12/58 Object Store metadata is a Pandora’s Box. Best keep it closed.
  • 13. Eventual Consistency is Hard Ericssson Machine Learning Event 13/58 if (r1 == null) { // Loop until read completes } else if (r1 == W1){ // do Y } else if (r1 == W2) { // do Z } r1 = read(); S3 guarantees
  • 14. Prediction 2016-11-14 14/58 NoSQL systems grew from the need to scale-out relational databases. NewSQL systems bring both scalability and strong consistency, and companies moving back to strong consistency.* *http://www.cubrid.org/blog/dev-platform/spanner-globally-distributed-database-by-google/ Object stores systems grew from the need to scale-out filesystems. A new generation of hierarchical filesystem will appear that bring both scalability and hierarchy, and companies will eventually move back to scalable hierarchical filesystems.
  • 15. HopsFS and Hops YARN 2016-11-14 15/58 16x Performance
  • 16. Externalizing the NameNode/YARN State •Problem: Metadata is preventing Hadoop from becoming the great platform we know she can be! •Solution: Move the metadata off the JVM Heap •Where? To an in-memory database that can be transactionally and efficiently queried and managed. The database should be Open-Source. We chose NDB – aka MySQL Cluster. 16
  • 17. HopsFS – 1.2 million ops/sec (16X HDFS) 2016-11-14 17/58NDB Setup: Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
  • 18. HopsFS Metadata Scaleout 18Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop
  • 19. HopsFS Architecture 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 19/58 NameNodes NDB Leader HDFS Client DataNodes
  • 20. Hops YARN Architecture 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 20/58 ResourceMgrs NDB Scheduler YARN Client NodeManagers Resource Trackers Heartbeats (70-95%) AM Reqs (5-30%)
  • 21. Extending Metadata in Hops •JSON API (with and without schema) public void attachMetadata(Json obj, String pathToFileorDir) • Row added with jsonObj & foreign key to the inode (on delete cascade) • Use to annotate files/directories for search in Elasticsearch •Add Tables in the database • Foreign keys to inode/applications ensure metadata integrity • Transactions involving both the inode(s) and metadata ensure metadata consinstency • Enables modular extensions to Namenodes and ResourceManagers 2016-11-14 21/58 id json_obj inode (fk) 1 { trump: non} 10101 inode_id attrs 10101 /tmp…
  • 23. Eventually Consistent Metadata with Epipe 23 Topics Projects Directories Files NDB Kafka async. replication Metadata Integrity maintained by one-way Asynchronous Replication. [ePipe Tutorial, BOSS Workshop, VLDB 2016] Epipe Elasticsearch
  • 24. Maintaining Eventually Consistent Metadata 24 Topics Partitions ACLs Zookeeper/KafkaDatabase Metadata integrity maintained by custom recovery logic and polling. Hopsworks polling
  • 25. Tinker-Friendly Metadata •Erasure Coding in HopsFS •Free-Text search of HopsFS namespace - Export changelog for NDB to Elasticsearch •Dynamic User Roles in HopsFS •New abstractions: Projects and Datasets Extensible, tinker-friendly metadata is like 3-d printing for Hadoop 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 25/58
  • 26. Leveraging Extensible Metadata in Hops 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 26/58
  • 27. New Concepts in Hops Hadoop Hops •Projects - Datasets - Topics - Users - Jobs - Notebooks Hadoop •Users •Applications •Jobs •Files •Clusters •ACLs •Sys Admins •Kerberos 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 27/58 Enabled by Extensible Metadata
  • 28. Notebooks Simplifying Hadoop with Projects 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 28/58 App Logs Datasets Kafka Topics HDFS files/dir Project Users Access Control Metadata in Hops Metadata not in Hops
  • 29. Simplifying Hadoop with Datasets •Datasets are a directory subtree in HopsFS - Can be shared securely between projects - Indexed by Elasticsearch •Datasets can be made public and downloaded from any Hopsworks cluster anywhere in the world 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 29/58
  • 30. Hops Metadata Summarized Elasticsearch Kafka Projects Datasets ------------------ Files Directories Containers Provenance Security Quotas Hopsworks REST API Epipe (DB Changelog) The Distributed Database is the Single Source-of-Truth for Metadata Eventual Consistencymutation 2-Phase Commit in NDB
  • 31. Hopsworks ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 31
  • 32. Project-Based Multi-Tenancy ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 32 •A project is a collection of - Users with Roles - HDFS DataSets - Kafka Topics - Notebooks, Jobs •Per-Project quotas - Storage in HDFS - CPU in YARN • Uber-style Pricing •Sharing across Projects - Datasets/Topics project dataset 1 dataset N Topic 1 Topic N Kafka HDFS
  • 33. Project Roles Data Owner Privileges - Import/Export data - Manage Membership - Share DataSets, Topics Data Scientist Privileges - Write and Run code 33 We delegate administration of privileges to users
  • 34. Hopsworks – Dynamic Roles 34 Alice@gmail.com NSA__Alice Authenticate Users__Alice Hopsworks HopsFS HopsYARN Projects Secure Impersonation Kafka X.509 Certificates
  • 35. X.509 Certificates Everywhere •User and service certs share same self-signed root CA •Project-Specific User Certificates - Every user in every project is issued with a X.509 certificate, containing the project-specific userID. • Scales using intermediate CAs at each Hopsworks instance. • Inspired by Netflix’ BLESS system. •Service Certificates - Services use a host-specific certificate that is signed by the root CA. Process managed by an agent program (kagent). - Services identify SSL clients by extracting the CommonName from client certificate in RPC calls. Kerberos keytab is gone. 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 35/58
  • 36. X.509 Certificate Generation 36 Alice@gmail.com Users don’t see the certificates. Users authenticate using: • LDAP, • password, • 2-Factor Authentication Add/Del Users Distributed Database Insert/Remove CertsProject Mgr Root CA Kagent HDFS Spark Kafka YARN Cert Signing Requests Intermediate Certificate Authority Hopsworks
  • 37. Simplifying Flink/Spark Streaming 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 37/58 Alice@gmail.com 1. Launch Spark Job Distributed Database 2. Get certs, service endpoints YARN Private LocalResources Spark/Flink Streaming App 4. Materialize certs 3. YARN Job, config 6. Get Schema 7. Consume Produce 5. Read Certs Hopsworks KafkaUtil 8. Authenticate
  • 38. Secure Kafka Producer Application 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 38/58 Developer Operations 1.Discover: Schema Registry and Kafka Broker Endpoints 2.Create: Kafka Properties file with certs and broker details 3.Create: producer using Kafka Properties 4.Download: the Schema for the Topic from the Schema Registry 5.Distribute: X.509 certs to all hosts on the cluster 6.Cleanup securely All of these steps are now down automatically by Hopsworks’ KafkaUtil library
  • 39. Hops simplifies Secure Flink/Kafka Producer Properties props = new Properties(); props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, brokerList); props.put(SCHEMA_REGISTRY_URL, restApp.restConnect); props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, org.apache.kafka.common.serialization.StringSerializer.class); props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, io.confluent.kafka.serializers.KafkaAvroSerializer.class); props.put("producer.type", "sync"); props.put("serializer.class","kafka.serializer.StringEncoder"); props.put("request.required.acks", "1"); props.put("ssl.keystore.location","/var/ssl/kafka.client.keysto re.jks") props.put("ssl.keystore.password","test1234 ") props.put("ssl.key.password","test1234") ProducerConfig config = new ProducerConfig(props); String userSchema = "{"namespace": "example.avro", "type": "record", "name": "User"," + ""fields": [{"name": "name", "type": "string"}]}"; Schema.Parser parser = new Schema.Parser(); Schema schema = parser.parse(userSchema); GenericRecord avroRecord = new GenericData.Record(schema); avroRecord.put("name", "testUser"); Producer<String, String> producer = new Producer<String, String>(config); ProducerRecord<String, Object> message = new ProducerRecord<>(“topicName”, avroRecord ); producer.send(data); 39/8 Losts of Hard-Coded Endpoints Here! StreamExecutionEnvironment env = … FlinkProducer prod = KafkaUtil.getFlinkProducer (topicName); DataStream<…> ms = env.addSource(…); ms.addSink(producer); env.execute(“Producing"); Massively Simplified Code for Hops/Flink/Kafka
  • 40. Zeppelin Support for Spark/Livy 40
  • 41. Spark Jobs in YARN with Livy ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 41 [Image from: http://gethue.com]
  • 42. Debugging Spark Jobs with Dr Elephant •Project-specific view of performance and correctness issues for completed Spark Jobs •Customizable heuristics •Doesn’t show killed jobs 2016-11-14 ApacheCon Europe BigData, Hopsworks, J Dowling, Nov 2016 42/58
  • 43. Hops in Production at www.hops.site 43 SICS ICE: A datacenter research and test environment Purpose: Increase knowledge, strengthen universities, companies and researchers
  • 44. Karamel/Chef for Automated Installation 44 Google Compute Engine BareMetal EGI/Shibboleth
  • 46. Summary •Hops is the only European distribution of Hadoop - More scalable, tinker-friendly, and open-source. •Hopsworks provides first-class support for Flink-/Spark-Kafka-as-a-Service - Streaming or Batch Jobs - Zeppelin Notebooks •Hopsworks provides best-in-class support for secure streaming applications with Kafka 2016-11-14 46
  • 47. Hops Team Active: Jim Dowling, Seif Haridi, Tor Björn Minde, Gautier Berthou, Salman Niazi, Mahmoud Ismail, Theofilos Kakantousis, Antonios Kouzoupis, Ermias Gebremeskel. Alumni: Vasileios Giannokostas, Johan Svedlund Nordström, Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca, Misganu Dessalegn, K “Sri” Srijeyanthan, Jude D’Souza, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems, Stig Viaene, Hooman Peiro, Evangelos Savvidis, Steffen Grohsschmiedt, Qi Qi, Gayana Chandrasekara, Nikolaos Stanogias, Ioannis Kerkinos, Peter Buechler, Pushparaj Motamari, Hamid Afzali, Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.