SlideShare uma empresa Scribd logo
1 de 34
Introduction to
NetGuardians’
Big Data
Software Stack
Jerome Kehrli, Head of R&D
Geneva, September 2017
Agenda
• Introducing NetGuardians
• Software Stack
• Typical Architecture
• NetGuardians’ Use Cases
• ElasticSearch / Spark / Mesos
Constraints and Behaviour
About NetGuardians
• Top Fintech Europe Company
• Behavioural analysis based on risk
models combining human actions
relative to channels, technical
layers and transactions.
• Stay on top of new regulatory
needs and anti-fraud patterns
using profiling and analytics
• Our intelligence updates
automatically deliver new controls
XXXXXX XXX
E-BANKINGE-BANKING
IT layers
Transactions
Channels
The Problem
70% is internal
Fraud costs the world
$3trillion per year
Certified Fraud Examiners,
Report to the Nations, 2014
$6 trillion
Projected cyber crime
cost by 2021
Cyber Security Ventures, 2016
It takes 18 months on average
to detect fraud.
Most remains undetected.
Certified Fraud Examiners, Report to the
Nations, 2014
$6
trillion
$3
trillion
The fine
one single bank was slapped
with due to inadequate
internal controls and slow
documentation process
Bloomberg, April 2015
$2.5
billion
All the caps you need
One single platform
Unique solution made for banks
All the caps you need
One single platform
Unique solution made for banks
References
Retail banking
Private banking
Scalable Big Data Technology
Analytics Platform
Software Stack
Mesos is a distributed systems kernel.
Runs on every machine and provides applications (…) with
API’s for resource management and scheduling across
entire datacenter and cloud environments.
Apache Spark is a fast and general engine for large-scale
data processing.
Provides programmers with an API functioning as a working
set for distributed programs that offers a versatile form of
distributed shared memory.
ElasticSearch is a distributed, real-time, RESTful search and
analytics document-oriented storage engine.
Lets one perform and combine many types of searches -
structured, unstructured, geo, metric - in real time.
Apache
(V1.3 = July 2017
V1.0 = July 2016)
Apache
(V2.2 = July 2017
V1.0 = May 2014)
ElasticSearch
(V6.0b = July 2017
V1.0 = February 2014)
ES-Hadoop : connect the massive data storage and deep
processing power of Hadoop with the real-time search and
analytics of Elasticsearch.
Interestingly, Spark can perfectly use ES-Hadoop to load from
or store data to ElasticSearch outside of an Hadoop stack.
The spark connector from the ES-Hadoop library has no
dependency on a Hadoop stack whatsoever.
ES-Hadoop
ES
ELK-MS
Architecture
ELK-MS - Technical Architecture
ELK-MS - System Architecture
ELK-MS - Typical Application Architecture
NetGuardians
Use Cases
Analytics approach
Pattern Based Intelligence
• Fundamentally rule based
• Implemented as pyspark scripts
• Custom approach (no framework)
Profiling
• Statistical Model
• Natively implemented using both
ES and spark statistics functions
• Custom approach (no framework)
Machine Learning
• Advanced algorithms
• Prototyped using Python SciKit
learn
• Industrialized using Spark MLlib
Typical Data Flow
Data-locality optimization is not optional for us !
ES / Spark / Mesos
Constraints and
behaviour
ES-Hadoop and Data Locality
Data-locality enforcement works well.
• ES-Hadoop makes Spark understand the
topology of the shards on ES
• Mesos / Spark respects locality requirements,
creates as many partitions as shards.
It works only under nominal conditions.
Several factors compromise data-locality:
→ Spark waits only for
spark.locality.wait=10s trying to get the
processing executed on the spark node co-
located to an ES shard
← If ES on co-located node is busy, ES can decide
to answer from another node
Mesos / Spark Scheduling Mode
In Coarse Grained scheduling mode, Mesos only
knows spark executor processes.
• Mesos books as much cluster resources as
possible to allocate Spark executors for a job.
Historically, Mesos on Spark can use Fine
Grained scheduling mode, where Mesos
schedules each and every individual spark task.
• Kills performances !
• Deprecated:
https://issues.apache.org/jira/browse/SPARK
-11857
Spark Static Resource Allocation vs. Dynamic Allocation (1/2)
Static Resource Allocation
• Mesos / Spark decides allocated resources at
job init time
• Allocated resources are kept until the job
completes
• 2 noteworthy consequences :
1. By default, every single job running
alone gets the whole cluster.
A following job would need to wait.
2. Several jobs arriving together would get
the cluster fairly shared.
If only one job is long-lived, that job
would still need to complete its
execution on his small portion.
Spark Static Resource Allocation vs. Dynamic Allocation (2/2)
Dynamic Allocation
• Designed as a solution the previous problems
• But … Spark‘s Dynamic Allocation messes up
data locality optimization completely.
• ES-Hadoop makes spark request as many
executors as shards and indicates
as preferred location the nodes owning the
ES shards.
• Dynamic allocation bypasses this
completely and screws data-locality
optimization
Dynamic Allocation
• Designed as a solution the previous problems
• Works out of the Box
Other concerns
• Python latency
• Java and Scala jobs run natively in the Spark JVM.
• Pyspark launches “some tasks” in a separate process than the Spark JVM.
• DataFrame or RDD methods exposed to python scripts are actually implemented in native
Scala underneath.
• One noticeable exception: UDF (User Defined functions) implemented in python!
• One can very well still use pyspark but write UDF in Scala.
• Repartitioning
• A redistribution of a dataset on the cluster is only hardly achievable … and not necessarily
desirable.
• Advanced ES queries
• The ES-Hadoop connector can only submit “simple” requests to ES, with filtering (now)
• Advanced features such as aggregation queries cannot be used
ES / Spark / Mesos
Why is it cool ?
Why cool ? (1/5)
Spark’s API is brilliant for our use cases (NetGuardians)
Pattern Based Intelligence
• Implementing our rules in pyspark
is straightforward
• We are now considering DRESS on
spark streaming
Profiling
• Out of the box with Spark’s
statistics functions
• Here as well we consider spark
streaming for event scoring
Machine Learning
• We prototype with Python SciKit
Learn
• Implementation on spark is easy
with Spark MLlib
Why cool ? (2/5)
What do we want ?
Initial situation
Why cool ? (2/5)
What do we want ?
Working with a small
subset of the data
Why cool ? (2/5)
What do we want ?
Working with a full
month of data
Why cool ? (2/5)
What do we want ?
Working with the
whole dataset
Why cool ? (3/5)
Processing Distribution scaling linearly with Data Distribution
Works Out of the box with
• Dynamic Allocation in Spark + Mesos
• ES-Hadoop / ES-Spark connector data locality optimization
Why cool ? (4/5)
Processing Distribution scaling linearly with Data Distribution
ES / Spark / Mesos provide the basic building blocks to distribute
and scale the processing exactly how we want
• ES-Hadoop : Data locality optimization
• Mesos / Spark : spark.cores.max=X configuration
• ElasticSearch : search_shards API
Golden Rule : use spark.core.max = Nbr Shards
Why cool ? (5/5)
“One ring to rule them all ...”
• ES, Spark and Mesos are
designed to run on large clusters
• But they work very well as well
on one single fat machine with
tons of CPUs and RAM
• We deploy the same platform in
tier 1 banks and small banks.
THANK YOU!
NetGuardians SA Headquarters
Rue Galilée 6
1400 Yverdon-les-Bains
Switzerland
Tel: +41 24 425 97 60
Email: info@netguardians.ch
www.netguardians.ch
Linkedin.com/company/netguardians
Facebook.com/NetGuardians
@netguardians

Mais conteúdo relacionado

Mais procurados

02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big dataRaul Chong
 
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...GetInData
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization Denodo
 
ParStream - Big Data for Business Users
ParStream - Big Data for Business UsersParStream - Big Data for Business Users
ParStream - Big Data for Business UsersParStream Inc.
 
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)Denodo
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyDatabricks
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Denodo
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?Denodo
 
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...Dataconomy Media
 
Strategyzing big data in telco industry
Strategyzing big data in telco industryStrategyzing big data in telco industry
Strategyzing big data in telco industryParviz Iskhakov
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo
 
Information Technology
Information TechnologyInformation Technology
Information TechnologySahil Mahajan
 
San Antonio’s electric utility making big data analytics the business of the ...
San Antonio’s electric utility making big data analytics the business of the ...San Antonio’s electric utility making big data analytics the business of the ...
San Antonio’s electric utility making big data analytics the business of the ...DataWorks Summit
 
Moving Targets: Harnessing Real-time Value from Data in Motion
Moving Targets: Harnessing Real-time Value from Data in Motion Moving Targets: Harnessing Real-time Value from Data in Motion
Moving Targets: Harnessing Real-time Value from Data in Motion Inside Analysis
 
End to End Supply Chain Control Tower
End to End Supply Chain Control TowerEnd to End Supply Chain Control Tower
End to End Supply Chain Control TowerDatabricks
 
Scaling Face Recognition with Big Data
Scaling Face Recognition with Big DataScaling Face Recognition with Big Data
Scaling Face Recognition with Big DataBogdan Bocse
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersDataWorks Summit
 

Mais procurados (20)

02 a holistic approach to big data
02 a holistic approach to big data02 a holistic approach to big data
02 a holistic approach to big data
 
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
Understanding Big Data Analytics - solutions for growing businesses - Rafał M...
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
ParStream - Big Data for Business Users
ParStream - Big Data for Business UsersParStream - Big Data for Business Users
ParStream - Big Data for Business Users
 
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
A Successful Data Strategy for Insurers in Volatile Times (ASEAN)
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform Strategy
 
Data engineering design patterns
Data engineering design patternsData engineering design patterns
Data engineering design patterns
 
Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)Future of Data Strategy (ASEAN)
Future of Data Strategy (ASEAN)
 
What is the future of data strategy?
What is the future of data strategy?What is the future of data strategy?
What is the future of data strategy?
 
Smart data for a predictive bank
Smart data for a predictive bankSmart data for a predictive bank
Smart data for a predictive bank
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
 
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...Stephen Cantrell, kdb+ Developer at Kx Systems  “Kdb+: How Wall Street Tech c...
Stephen Cantrell, kdb+ Developer at Kx Systems “Kdb+: How Wall Street Tech c...
 
Strategyzing big data in telco industry
Strategyzing big data in telco industryStrategyzing big data in telco industry
Strategyzing big data in telco industry
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
 
Information Technology
Information TechnologyInformation Technology
Information Technology
 
San Antonio’s electric utility making big data analytics the business of the ...
San Antonio’s electric utility making big data analytics the business of the ...San Antonio’s electric utility making big data analytics the business of the ...
San Antonio’s electric utility making big data analytics the business of the ...
 
Moving Targets: Harnessing Real-time Value from Data in Motion
Moving Targets: Harnessing Real-time Value from Data in Motion Moving Targets: Harnessing Real-time Value from Data in Motion
Moving Targets: Harnessing Real-time Value from Data in Motion
 
End to End Supply Chain Control Tower
End to End Supply Chain Control TowerEnd to End Supply Chain Control Tower
End to End Supply Chain Control Tower
 
Scaling Face Recognition with Big Data
Scaling Face Recognition with Big DataScaling Face Recognition with Big Data
Scaling Face Recognition with Big Data
 
Monitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service ProvidersMonitizing Big Data at Telecom Service Providers
Monitizing Big Data at Telecom Service Providers
 

Semelhante a Introduction to NetGuardians' Big Data Software Stack

DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platformmartinbpeters
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsYousun Jeong
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCHimanshu Bedi
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDatabricks
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesJen Aman
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014mahchiev
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...Simon Ambridge
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for BeginnersAnirudh
 

Semelhante a Introduction to NetGuardians' Big Data Software Stack (20)

DoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics PlatformDoneDeal - AWS Data Analytics Platform
DoneDeal - AWS Data Analytics Platform
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
Advanced Visualization of Spark jobs
Advanced Visualization of Spark jobsAdvanced Visualization of Spark jobs
Advanced Visualization of Spark jobs
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
DataOps with Project Amaterasu
DataOps with Project AmaterasuDataOps with Project Amaterasu
DataOps with Project Amaterasu
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Spark
SparkSpark
Spark
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Sa introduction to big data pipelining with cassandra & spark west mins...
Sa introduction to big data pipelining with cassandra & spark   west mins...Sa introduction to big data pipelining with cassandra & spark   west mins...
Sa introduction to big data pipelining with cassandra & spark west mins...
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 

Mais de Jérôme Kehrli

Introduction to Operating Systems
 Introduction to Operating Systems Introduction to Operating Systems
Introduction to Operating SystemsJérôme Kehrli
 
Introduction to Modern Software Architecture
Introduction to Modern Software ArchitectureIntroduction to Modern Software Architecture
Introduction to Modern Software ArchitectureJérôme Kehrli
 
A proposed framework for Agile Roadmap Design and Maintenance
A proposed framework for Agile Roadmap Design and MaintenanceA proposed framework for Agile Roadmap Design and Maintenance
A proposed framework for Agile Roadmap Design and MaintenanceJérôme Kehrli
 
The search for Product-Market Fit
The search for Product-Market FitThe search for Product-Market Fit
The search for Product-Market FitJérôme Kehrli
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private BankingJérôme Kehrli
 
From Product Vision to Story Map - Lean / Agile Product shaping
From Product Vision to Story Map - Lean / Agile Product shapingFrom Product Vision to Story Map - Lean / Agile Product shaping
From Product Vision to Story Map - Lean / Agile Product shapingJérôme Kehrli
 
Artificial Intelligence and Digital Banking - What about fraud prevention ?
Artificial Intelligence and Digital Banking - What about fraud prevention ?Artificial Intelligence and Digital Banking - What about fraud prevention ?
Artificial Intelligence and Digital Banking - What about fraud prevention ?Jérôme Kehrli
 
Artificial Intelligence for Banking Fraud Prevention
Artificial Intelligence for Banking Fraud PreventionArtificial Intelligence for Banking Fraud Prevention
Artificial Intelligence for Banking Fraud PreventionJérôme Kehrli
 
Linux and Java - Understanding and Troubleshooting
Linux and Java - Understanding and TroubleshootingLinux and Java - Understanding and Troubleshooting
Linux and Java - Understanding and TroubleshootingJérôme Kehrli
 
Deciphering the Bengladesh bank heist
Deciphering the Bengladesh bank heistDeciphering the Bengladesh bank heist
Deciphering the Bengladesh bank heistJérôme Kehrli
 
Periodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and PracticesPeriodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and PracticesJérôme Kehrli
 
Agility and planning : tools and processes
Agility and planning  : tools and processesAgility and planning  : tools and processes
Agility and planning : tools and processesJérôme Kehrli
 
Bytecode manipulation with Javassist for fun and profit
Bytecode manipulation with Javassist for fun and profitBytecode manipulation with Javassist for fun and profit
Bytecode manipulation with Javassist for fun and profitJérôme Kehrli
 
Digitalization: A Challenge and An Opportunity for Banks
Digitalization: A Challenge and An Opportunity for BanksDigitalization: A Challenge and An Opportunity for Banks
Digitalization: A Challenge and An Opportunity for BanksJérôme Kehrli
 
The Blockchain - The Technology behind Bitcoin
The Blockchain - The Technology behind Bitcoin The Blockchain - The Technology behind Bitcoin
The Blockchain - The Technology behind Bitcoin Jérôme Kehrli
 

Mais de Jérôme Kehrli (18)

Introduction to Operating Systems
 Introduction to Operating Systems Introduction to Operating Systems
Introduction to Operating Systems
 
Introduction to Modern Software Architecture
Introduction to Modern Software ArchitectureIntroduction to Modern Software Architecture
Introduction to Modern Software Architecture
 
A proposed framework for Agile Roadmap Design and Maintenance
A proposed framework for Agile Roadmap Design and MaintenanceA proposed framework for Agile Roadmap Design and Maintenance
A proposed framework for Agile Roadmap Design and Maintenance
 
The search for Product-Market Fit
The search for Product-Market FitThe search for Product-Market Fit
The search for Product-Market Fit
 
Big data in Private Banking
Big data in Private BankingBig data in Private Banking
Big data in Private Banking
 
From Product Vision to Story Map - Lean / Agile Product shaping
From Product Vision to Story Map - Lean / Agile Product shapingFrom Product Vision to Story Map - Lean / Agile Product shaping
From Product Vision to Story Map - Lean / Agile Product shaping
 
Artificial Intelligence and Digital Banking - What about fraud prevention ?
Artificial Intelligence and Digital Banking - What about fraud prevention ?Artificial Intelligence and Digital Banking - What about fraud prevention ?
Artificial Intelligence and Digital Banking - What about fraud prevention ?
 
Artificial Intelligence for Banking Fraud Prevention
Artificial Intelligence for Banking Fraud PreventionArtificial Intelligence for Banking Fraud Prevention
Artificial Intelligence for Banking Fraud Prevention
 
Linux and Java - Understanding and Troubleshooting
Linux and Java - Understanding and TroubleshootingLinux and Java - Understanding and Troubleshooting
Linux and Java - Understanding and Troubleshooting
 
Deciphering the Bengladesh bank heist
Deciphering the Bengladesh bank heistDeciphering the Bengladesh bank heist
Deciphering the Bengladesh bank heist
 
Periodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and PracticesPeriodic Table of Agile Principles and Practices
Periodic Table of Agile Principles and Practices
 
Agility and planning : tools and processes
Agility and planning  : tools and processesAgility and planning  : tools and processes
Agility and planning : tools and processes
 
Bytecode manipulation with Javassist for fun and profit
Bytecode manipulation with Javassist for fun and profitBytecode manipulation with Javassist for fun and profit
Bytecode manipulation with Javassist for fun and profit
 
DevOps explained
DevOps explainedDevOps explained
DevOps explained
 
Digitalization: A Challenge and An Opportunity for Banks
Digitalization: A Challenge and An Opportunity for BanksDigitalization: A Challenge and An Opportunity for Banks
Digitalization: A Challenge and An Opportunity for Banks
 
Lean startup
Lean startupLean startup
Lean startup
 
Blockchain 2.0
Blockchain 2.0Blockchain 2.0
Blockchain 2.0
 
The Blockchain - The Technology behind Bitcoin
The Blockchain - The Technology behind Bitcoin The Blockchain - The Technology behind Bitcoin
The Blockchain - The Technology behind Bitcoin
 

Último

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Introduction to NetGuardians' Big Data Software Stack

  • 1. Introduction to NetGuardians’ Big Data Software Stack Jerome Kehrli, Head of R&D Geneva, September 2017
  • 2. Agenda • Introducing NetGuardians • Software Stack • Typical Architecture • NetGuardians’ Use Cases • ElasticSearch / Spark / Mesos Constraints and Behaviour
  • 3. About NetGuardians • Top Fintech Europe Company • Behavioural analysis based on risk models combining human actions relative to channels, technical layers and transactions. • Stay on top of new regulatory needs and anti-fraud patterns using profiling and analytics • Our intelligence updates automatically deliver new controls XXXXXX XXX E-BANKINGE-BANKING IT layers Transactions Channels
  • 4. The Problem 70% is internal Fraud costs the world $3trillion per year Certified Fraud Examiners, Report to the Nations, 2014 $6 trillion Projected cyber crime cost by 2021 Cyber Security Ventures, 2016 It takes 18 months on average to detect fraud. Most remains undetected. Certified Fraud Examiners, Report to the Nations, 2014 $6 trillion $3 trillion The fine one single bank was slapped with due to inadequate internal controls and slow documentation process Bloomberg, April 2015 $2.5 billion
  • 5. All the caps you need One single platform Unique solution made for banks
  • 6. All the caps you need One single platform Unique solution made for banks
  • 8. Scalable Big Data Technology
  • 10. Mesos is a distributed systems kernel. Runs on every machine and provides applications (…) with API’s for resource management and scheduling across entire datacenter and cloud environments. Apache Spark is a fast and general engine for large-scale data processing. Provides programmers with an API functioning as a working set for distributed programs that offers a versatile form of distributed shared memory. ElasticSearch is a distributed, real-time, RESTful search and analytics document-oriented storage engine. Lets one perform and combine many types of searches - structured, unstructured, geo, metric - in real time. Apache (V1.3 = July 2017 V1.0 = July 2016) Apache (V2.2 = July 2017 V1.0 = May 2014) ElasticSearch (V6.0b = July 2017 V1.0 = February 2014)
  • 11. ES-Hadoop : connect the massive data storage and deep processing power of Hadoop with the real-time search and analytics of Elasticsearch. Interestingly, Spark can perfectly use ES-Hadoop to load from or store data to ElasticSearch outside of an Hadoop stack. The spark connector from the ES-Hadoop library has no dependency on a Hadoop stack whatsoever. ES-Hadoop ES
  • 13. ELK-MS - Technical Architecture
  • 14. ELK-MS - System Architecture
  • 15. ELK-MS - Typical Application Architecture
  • 17. Analytics approach Pattern Based Intelligence • Fundamentally rule based • Implemented as pyspark scripts • Custom approach (no framework) Profiling • Statistical Model • Natively implemented using both ES and spark statistics functions • Custom approach (no framework) Machine Learning • Advanced algorithms • Prototyped using Python SciKit learn • Industrialized using Spark MLlib
  • 18. Typical Data Flow Data-locality optimization is not optional for us !
  • 19. ES / Spark / Mesos Constraints and behaviour
  • 20. ES-Hadoop and Data Locality Data-locality enforcement works well. • ES-Hadoop makes Spark understand the topology of the shards on ES • Mesos / Spark respects locality requirements, creates as many partitions as shards. It works only under nominal conditions. Several factors compromise data-locality: → Spark waits only for spark.locality.wait=10s trying to get the processing executed on the spark node co- located to an ES shard ← If ES on co-located node is busy, ES can decide to answer from another node
  • 21. Mesos / Spark Scheduling Mode In Coarse Grained scheduling mode, Mesos only knows spark executor processes. • Mesos books as much cluster resources as possible to allocate Spark executors for a job. Historically, Mesos on Spark can use Fine Grained scheduling mode, where Mesos schedules each and every individual spark task. • Kills performances ! • Deprecated: https://issues.apache.org/jira/browse/SPARK -11857
  • 22. Spark Static Resource Allocation vs. Dynamic Allocation (1/2) Static Resource Allocation • Mesos / Spark decides allocated resources at job init time • Allocated resources are kept until the job completes • 2 noteworthy consequences : 1. By default, every single job running alone gets the whole cluster. A following job would need to wait. 2. Several jobs arriving together would get the cluster fairly shared. If only one job is long-lived, that job would still need to complete its execution on his small portion.
  • 23. Spark Static Resource Allocation vs. Dynamic Allocation (2/2) Dynamic Allocation • Designed as a solution the previous problems • But … Spark‘s Dynamic Allocation messes up data locality optimization completely. • ES-Hadoop makes spark request as many executors as shards and indicates as preferred location the nodes owning the ES shards. • Dynamic allocation bypasses this completely and screws data-locality optimization Dynamic Allocation • Designed as a solution the previous problems • Works out of the Box
  • 24. Other concerns • Python latency • Java and Scala jobs run natively in the Spark JVM. • Pyspark launches “some tasks” in a separate process than the Spark JVM. • DataFrame or RDD methods exposed to python scripts are actually implemented in native Scala underneath. • One noticeable exception: UDF (User Defined functions) implemented in python! • One can very well still use pyspark but write UDF in Scala. • Repartitioning • A redistribution of a dataset on the cluster is only hardly achievable … and not necessarily desirable. • Advanced ES queries • The ES-Hadoop connector can only submit “simple” requests to ES, with filtering (now) • Advanced features such as aggregation queries cannot be used
  • 25. ES / Spark / Mesos Why is it cool ?
  • 26. Why cool ? (1/5) Spark’s API is brilliant for our use cases (NetGuardians) Pattern Based Intelligence • Implementing our rules in pyspark is straightforward • We are now considering DRESS on spark streaming Profiling • Out of the box with Spark’s statistics functions • Here as well we consider spark streaming for event scoring Machine Learning • We prototype with Python SciKit Learn • Implementation on spark is easy with Spark MLlib
  • 27. Why cool ? (2/5) What do we want ? Initial situation
  • 28. Why cool ? (2/5) What do we want ? Working with a small subset of the data
  • 29. Why cool ? (2/5) What do we want ? Working with a full month of data
  • 30. Why cool ? (2/5) What do we want ? Working with the whole dataset
  • 31. Why cool ? (3/5) Processing Distribution scaling linearly with Data Distribution Works Out of the box with • Dynamic Allocation in Spark + Mesos • ES-Hadoop / ES-Spark connector data locality optimization
  • 32. Why cool ? (4/5) Processing Distribution scaling linearly with Data Distribution ES / Spark / Mesos provide the basic building blocks to distribute and scale the processing exactly how we want • ES-Hadoop : Data locality optimization • Mesos / Spark : spark.cores.max=X configuration • ElasticSearch : search_shards API Golden Rule : use spark.core.max = Nbr Shards
  • 33. Why cool ? (5/5) “One ring to rule them all ...” • ES, Spark and Mesos are designed to run on large clusters • But they work very well as well on one single fat machine with tons of CPUs and RAM • We deploy the same platform in tier 1 banks and small banks.
  • 34. THANK YOU! NetGuardians SA Headquarters Rue Galilée 6 1400 Yverdon-les-Bains Switzerland Tel: +41 24 425 97 60 Email: info@netguardians.ch www.netguardians.ch Linkedin.com/company/netguardians Facebook.com/NetGuardians @netguardians

Notas do Editor

  1. Disclaimer Je ne vais pas faire une intro à big data, je pars du principe que l’audience est familiarisée à Big Data (Moving processing to the data nodes, distribution of the form of partitioning and replicating, etc.) Je vais me concentrer sur la spécificité de la stack technologique chez NetGuardians
  2. Our intelligent software platform will give you a greater capability to detect the emerging insider fraud & risk threats, delivering ROI straight away.
  3. OK
  4. !! Préparer l’explication !!!
  5. !! Préparer l’explication !!!
  6. !! Préparer l’explication !!!
  7. !! Préparer l’explication !!!
  8. 27 -> 20 !!!! Présentation NetGuardians “un peu” plus courte Plus de détail, allez voir notre site web !! Vocabulaire : Big data = distribution = partitionnement (sharding) et réplication ! Les 3 slides super-textuels !!! “Vous aurez les slides ….” Juste préparer un résumé …..