This document provides an overview of modern software architecture models and concepts. It begins with an introduction to software architecture and definitions. It then discusses the Kruchten 5+1 view model for describing architecture using multiple views. Additional topics covered include the OCTO matrix approach, example architecture diagrams for a sample application called RIA Organizer, and modern architectures like big data, microservices and serverless computing.
2. 2
Part I – Software Architecture Models
1.1 Introduction to Software Architecture
1.2 Our illustration example
1.3 The Kruchten 5 + 1 View Model
1.4 The OCTO Matrix Approach
Part II - Modern Architectures
2.1 Big Data
2.2 The Death of the Moore Law
2.3 The CAP Theorem
2.4 NoSQL / NewSQL
2.5 Hadoop
2.6 Data Lake
2.7 Streaming Architecture
2.8 Lambda Architecture
2.9 Big Data 2.0 & Kubernetes
2.10 Microservices Architecture
Part III - Takeaways
Agenda
4. 4
Definitions 1/3
A software system's architecture is the set of principal design decisions about the system
Software architecture is the blueprint for a system's construction and evolution
Design decisions encompass the following aspects of the system under development
Structure,
Behaviour,
Interactions,
Non-functional properties
Taylor 2010
"Principal” implies an a degree of importance that grants a design decision an
"architectural status".
This implies that not all design decisions are architectural. As such, these do not
necessarily impact a system's architecture.
How one defines principal depends on what the stakeholders define as the system
goals.
5. 5
Definitions 2/3
An architecture is
the set of significant decisions about the organization of a software system,
the selection of the structural elements and their interfaces by which the system is
composed
together with their behavior as specified in the collaborations among those elements,
the composition of these structural and behavioral elements into progressively larger
subsystems,
and the architectural style that guides this organization, these elements and their
interfaces, their collaborations, and their composition.
RUP – Rational Unified Process
6. 6
Definitions 3/3
In most successful software projects, the expert developers working on that project have a
shared understanding of the system design. This shared understanding is called
‘architecture’. This understanding includes how the system is divided into components and how
the components interact through interfaces.
Architecture is about stuff that’s hard to change later
Ralph Johnson
Neal Ford
Architecture is about the important stuff
Martin Fowler
7. 7
Sidenotes
Any organization that designs a system (defined broadly) will produce a design whose structure
is a copy of the organization's communication structure.
Melvin E. Conway (Conway's law)
... all models are approximations. Essentially, all models are wrong, but some are useful.
However, the approximate nature of the model must always be borne in mind...
George Box
8. 8
Software Architecture is
A Process : to design a high-level solution
A Product : schemas, models, documentation, prototypes
Means : frameworks, libraries, middleware, etc. to ease implementation
of large systems
A Reality : the working software or Information System
My View
9. 9
Different Kind of Architectures
Enterprise Architecture Solution / Application Architecture
Enterprise Architecture defines the way the enterprise uses
several applications.
Metaphor : City Planning / City Map
Focus : Strategy / Business
Some Key Concerns:
- Uncover operational gaps
- Understand data-dependencies across the IT landscape
- Understand Interactions between Solutions / Applications
- Streamline the application landscape for optimal
performance
- Decommissioning of legacy solutions
- Eliminate redundancies
- Identify and avoid tech risks
Application architecture defines the various pieces that
compose an application
Metaphor : Building / House Architecture
Focus : Technology / Functional
Some Key Concerns:
- Define a best-fit solution for identified problems
- Ensure solution meets functional and non-functional
requirements
- Understand how application supports business
capabilities
- Understand functional fit, technical fit and risks
- Implement technical processes for Application
development
10. 10
Architecture or Design
Architecture Design Implementation
Abstraction Fine Granularity / Reality
Process of creating High-level
structures of a software system
Converts the software
characteristics into a high-level
structure
Micro-services, serverless,
streaming, lambda are some
software architecture patterns
Helps define high-level structure
of the software system
Process of creating a form of
specification of a software artifact
that helps implement the software
Describes all units of a software
system to support coding
Creational, structural and
behavioural are some types of
software design-patterns
Helps implement the software
16. 16
Philippe Kruchten defined a 4+1 Views Model to capture the description of Software
Architecture into multiple complementary views
in 1995 when he was working for Rational Software Corp.
The 4+1 views model is an information organization framework; it consists of logical,
process, development, and physical knowledge of an application, and end-user perspective
information.
A view is an aspect (subpart) of information.
A notion is a way of representing information.
The 4 + 1 Kruchten Views Model
Philippe Kruchten, Architectural Blueprints—The “4+1” View Model of Software Architecture
The “4+1” view model is rather “generic”: other notations and tools can be used, other design
methods can be used, especially for the logical and process decompositions, but we have
indicated the ones we have used with success.
17. Conceptual / Logic Physical / Operational
Non-functional
Functional
Logical / Structural View Implementation / Development View
Process / Behaviour View Deployment / Physical View
The logical view is concerned with the functionality
that the system provides to end-users.
UML Diagrams used to represent the logical view
include Class diagram, Communication diagram,
Sequence diagram.
The development view illustrates a system from
a programmer's perspective and is
concerned with software management. This
view is also known as the implementation view.
It uses the UML Component diagram to
describe system
components.
UML Diagrams used to
represent the development view
include the Package diagram.
The process view deals with the
dynamic aspects of the system,
explains the system processes and
how they communicate, and focuses on the runtime
behavior of the system. The process view addresses
concurrency, distribution, integrators, performance,
and scalability, etc. UML Diagrams to represent
process view include the Activity diagram.
The physical view depicts the system
from a system engineer's
point-of-view.
It is concerned with the topology of
software components on the physical layer, as
well as communication between these
components.
This view is also known as the deployment view.
UML Diagrams used to represent physical view
include the Deployment diagram.
Use Case / Scenario View
The description of an architecture is illustrated using a
small set of use cases, or scenarios which become a
fifth view. The scenarios describe sequences of
interactions between objects and / or processes. They
are used to identify architectural elements and to
illustrate and validate the architecture design. They also
serve as a starting point for tests of an architecture
prototype. UML Diagram(s) used to represent the
scenario view include the Use case diagram.
18. Conceptual / Logic Physical / Operational
Non-functional
Functional
Process / Behaviour View
Perspective: System Integrators
Stage: Design
Focus: Process decomposition
Concerns: Performances, Scalability,
Throughput, Synchronization, Concurrency
Artifacts:
- Sequence Diagrams / Activity Diagrams
- Communication / interactions diagrams
- State Machine Diagrams
- Timing Diagrams
Logical / Structural View
Perspective: End Users , Business
Analysts
Stage: Requirements Analysis
Focus: Components / Objects / Services
Model - Decomposition
Concerns: Functionality
Artifacts:
- Functions Schema
- Class / Objects Diagram
- (composite) Structure Diagram
- State Machine
Implementation / Development View
Perspective: Developers,
Designers
Stage: Design
Focus: Subsystem
decomposition
Concerns: Software /
Configuration Management
Artifacts:
- Components Diagram
- Package Diagram
Deployment / Physical View
Perspective: System Engineers
Stage: Design
Focus: Software mapping to
Hardware (deployment)
Concerns: System Topology,
Delivery, Installation,
Communication
Artifacts:
- Deployment diagram
- Network / Cluster topology
(not UML)
Use Case / Scenario View
Perspective: End User
Stage: Putting it all together
Focus: Understandability , usability
Concerns: Feature Decomposition
Artifacts:
- Use-case diagrams
- User Stories (not UML)
- Story Maps (not UML)
25. 25
OCTO Technology designed in 2010 a matrix that presents a 360 overview of
most-if-not-all questions, concerns and aspects that need to be answered and
addressed when defining a Software Architecture
The OCTO Architecture Matrix
The questions and concerns are
related to different levels of
architecture:
Functional
Application
Technical
System
They regroup different perspectives:
Security
Usage
Services
Data
Exchanges
27. 27
RIAO Functional Architecture
Email
Management
Contact
management
Email Search
Search
Email
Display
/
Edition Folder
Management
Global App. Email Application
Appointment
Display
/
Edition
Calendar
Display
/
Edition
Calendar Application
Folder
Display
/
Edition
Contact
Display /
Edition
Calendar
management
Appointment
Management
Contact
Search
Calendar
Search
Contact App.
Login
User
Management
Appointment
Mapping
Business /
Entry Points
User
Interactions
Services
&
Functions
Mgmt.
Search
Text
Compos.
HTML
Comp.
RTF
Compos.
Text
Display
HTML
Disp.
RTF
Display
Text
Compos.
HTML
Comp.
RTF
Compos.
Text
Display
HTML
Disp.
RTF
Display
Attachment
Management
Email
IO
28. 28
GridFS
Folder
Model
Email
Docs.
Attachem-
ent files
Appoint-
ment Docs.
Calendar
Model
Contact
Documents
User
Model
SMTP
Server
POP3
Store
RIAO
Backend
RIAO
UI
RIAO Application Architecture
Search
Email IO
Appointment Mapping
User
Model
Folder
Model
Email
Model
Calendar
Model
Appointm.
Model
Contact
Model
Attachem.
Model
Appointm.
Search
Email
Search
Contact
Search
Search Mgmt.
User
Mgmt.
Folder
Mgmt.
Email
Mgmt.
Calendar
Mgmt.
Appointm.
Mgmt.
Contact
Mgmt.
Attachem.
Mgmt.
Email Synchronization
Deleg.
Search
Service
User
Service
Email Service Calendar Service
Contact
Service
Login
Page
Profile
Edition
Folder
View
/
Edit
Email
Compos.
Email
View
/
Edit
Calendar
View
/
Edit
Appoinlt.
Compos.
Appoint.
View
/
Edit
Contact
View
/
Edit
Contact
Model
CRUD Fetch
Send
Search Page
REST API
Data
/
Exchanges
Integration
Busi-
ness
Presentation
APIs and Process Orchestration
CRUD
RTF
Display
RTF
Compos.
HTML
Compos.
Deleg.
Email Application Calendar App.
Contact
App
Text
Compos
Email
Model
Calendar
Model
Email Controller Calendar Controller Contact Ctr.
Search Control.
User Ctrl.
Loc. Storage
Main Page
29. 29
User
tier
Proces-
sing
tier
Integration
Tier
Web
browser
RIAO
UI
RIAO
Back.
RIAO Technical Architecture
HTTP
UI Controllers
JAX-RS / HTTPS
Java VM
Apache Proxy
Business managers
MongoDB Client
Views
JQuery CKEditor
Bootstrap
Business Services
DAOs
SMTP Client POP3 Client
Courier / Debian
IO Management
SMTP POP3
Spring
Boot
/
Tomcat
8
Runtime
Forms
Models
Linux
Debian
Spring
Security
Spring
Framework
Apache
Commons
SSL Cert.
Local
Store
Sess.
Ckie.
JAX-RS / HTTP
Main
Page
Obj./JSON
Map.
JSON / Object Mapping
30. 30
User Computer RIA Server
Tomcat (Spring Boot)
RIAO System Architecture
Apache
Proxy
Web browser
RIAO UI RIAO Backend
HTTPS
Courier Server
Courier / Debian
POP3
SMTP
HTTP
(User OS) Debian Linux
MongoDB
Node
MongoDB
Node
Mongo Node
Docker
Debian Linux
Integration
Processing / Business
Presentation
FirewallD
Open JDK 11 / JVM
Loc. Storage
Internet Internal Network
SystemD
Kubernetes Cluster
K8s
service
Locator
34. 34
Data deluge
5 exabytes of data
(5 billions of gigabytes)
has been generated
since the first measurements
until 2003,
In 2011, this
quantity was
generated in 2
days
In 2018, this
quantity was
generated in
2 minutes
Source: https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf
35. 35
Our architectures are 30 years old !
Corporate Operational Data
Internal GUI Space
Operational / Live Audit / Logs Archived Data …
Ext. Data
Staging Database
…
ETL
ETL
Datawarehouse
Storage
Cleaning / Cleansing / Enrichment / Remapping
Historize
Query
ETL
Reporting / Analytics / Querying
Data
Mart
Data
Mart
Data
Mart
Operational Application Space
Online
Business
Applications
Batch
Business
Applications
Monitoring /
Operation
Applications
External GUI Space
DMZ
Web
Apps
Desktop
Apps
Web
Apps
Mobile
Apps
Operational Information System Analytical Information System / Business Intelligence
37. 37
The Moore law
“The number of transistors and
resistors on a chip doubles every
24 months”
- Gordon Moore, 1965
38. 38
Technical capacitites evolution
For the 40 years, the IT component capabilties grew exponentially
The Moore law!
Source :
http://radar.oreilly.com/2011/08/building-data-startups.html
39. 39
Storage cost evolution
While the unit cost is decreasing…
0.01 $
0.10 $
1.00 $
10.00 $
100.00 $
1,000.00 $
10,000.00 $
100,000.00 $
1,000,000.00 $
10,000,000.00 $
1975 1980 1985 1990 1995 2000 2005 2010 2015
Hard Drive
RAM
Source :http://www.mkomo.com/cost-per-gigabyte
2012
5$/GB
1982
5M$/GB
41. 41
Disk throughput evolution
Issue : The throughput evolution is always lower than the capacity evolution
How read/write more and more data through an always thicker pipe?
Gain : x100 000
Capacity Gain:
x 10’000
In 15 years
Throughput Gain:
x 50
In 15 years
42. 42
New architectures and paradigms
Key
Idea #1
Key Idea #2
Key Idea #3
Since the data is to big to
fit one computer,
distribute it among many
computer (partitioning /
sharding) !
Run transaction and computation in
parallel on multiple (many!) nodes
and scale at the multi-datacenter
level the grid of CPU, RAM and
HDD
Move the code to the
data node, not the data
to the computing node
(Data tier revolution)
44. 44
The early days of digital data …
Before 1960, the data within a Computer Information
System was mostly stored in rather flat files
(sometimes indexed) manipulated by top-level software
systems.
Directly using flat files was cumbersome and painful…
Various needs emerge at the time :
Data isolation
Access efficiency
Data integrity
Reducing the time required to develop brand new
applications
Something else was required …
A bit of history …
45. 45
The relational model rules for 40 years !
E.g. an Exam Grade management app :
Display the subject of a student on his profile
screen, one needs to
1. Extract the personal data from the
“student” table
2. Fetch its subject if from the relation table
3. Read the subject title from the “subject”
table.
Enters the Relational Model …
1969 / Edgar F. Codd - RDBMS
Entities as Tables & associations
The relational model reduces redundancy to optimize disk
space usage
At the time of its creation
Disk storage was very expensive and limited
The volume of data in the Information Systems was rather
small
avoid redundancy to optimize disk space usage, thanks to
guaranties of :
Structure: using normal design forms and modeling
techniques
Coherence: using transaction principles and mechanisms
Why, oh why, to separate these 2 kind of information since in 95% of the use
cases around these data, both will always be used together ?!?
46. 46
The mid and late 2000’s were times of major changes in the IT landscape
Hardware capabilities significantly increased
eCommerce and internet trade, in general, exploded
Some internet companies, so-called the “Web giants” (Yahoo!, Facebook, Google, Amazon,
Ebay, Twitter, …), pushed traditional databases to their limits. Those databases are by
design hard to scale
With relational DBMSes, the only way to improve performance is by scaling up, i.e. getting
bigger servers (more CPU, more RAM, more disk, …). One eventually hits a hard limit
imposed by the current technology
The origins of NoSQL
Faster
More storage
More reliable
Investments
Hard limit
From a certain point,
investments yield little
improvement
Database server
Scaling up:
47. 47
By rethinking the architecture of databases, those companies were able to make
them scale at will, by adding more servers to clusters instead of upgrading the
servers.
The servers are not made of expensive, high-end hardware; they are qualified as
commodity servers (or commodity hardware)
The origins (cont’d)
Faster
More storage
More reliable
Investments
Power grows linearly
with the number of
servers (linear
scalability)
Scaling out:
Database cluster
48. 48
This is the essence of Big Data !
With most NoSQL databases, the data is not stored in one place (i.e. on one server). It is distributed
among the nodes of the cluster. When created, an object A is assigned to a node in the cluster. This is
called sharding – the amount of data assigned to a node is called a shard (also called partition)
Having more cluster nodes implies a higher risk of having some nodes crash, or a network outage splitting
the cluster in two. For this reason, and to avoid data loss, objects are also replicated across the clusters
The number of copies, called replicas, can be tuned. 3 replicas is a common figure
Data distribution
A B
C
D
A
A
B
B
C
C
D
D
The objects may move, as nodes crash or new nodes join the cluster, ready to take charge of some of the
objects. Such events are usually handled automatically by the cluster; the operation of shuffling objects
around to keep a fair repartition of data is called rebalancing
49. 49
The CAP Theorem
Consistency
All clients see the exactly the same
data at the same time, even in the
presence of an update (ACID
Properties)
Availability
The system continues
to operate and all
clients can see “a
version” of a replica,
even in the presence of
node failure
Partition-
tolerance
The system continues to
operate even when the
system is partitioned (some
nodes are unavailable)
AC CP
AP
Not
Possible
Availability
The cluster is available if a
request made by a client is always
acknowledged by the system, i.e.
it is guaranteed to be taken into
account
That doesn’t mean that the
request is processed
immediately. It may be put on
hold. An available system will
at a minimum acknowledge it
Client
Request
Acknowledgement
?
Partition tolerance
Partition Tolerance is verified
if a cluster can stand a
partition; if it continues to
operate when one or several
nodes disappear. (nodes crash,
network equipment down, etc.)
Partition tolerance is related to
availability and consistency, but
it is still different. It states that
the system continues to
function internally (e.g. ensuring
data distribution and
replication), whatever its
interactions with a client
Consistency
Consitency refers to the fact that all replicas
of an entity, identified by a key in the
database, have the same value
whatever the node queried
old version
new version
new version
new version
Client
Update
50. 50
The previous 3 properties, Consistency, Availability and Partition tolerance, are not independent. The CAP
theorem - or Brewer’s theorem - states that a distributed system cannot guarantee all 3 properties at the
same time
This is a theorem. That means it is formally true, but in practice it is less severe than it seems
The system or a client can often choose CA, AP or CP according to the context, and “walk” along the chosen
edge by appropriate tuning
Partition splits happen, but they are rare events (hopefully)
Rule of thumb
Traditional relational DBMSes are CA or CP – consistency is a must, in case of a problem either bring the
cluster down or split it and expect heavy synchronization later
Many NoSQL DBMSes are AP – availability is a must, and with big clusters failures happen so better live with
it. Consistency is only eventual
The CAP theorem
Consistency
Availability Partition-
tolerance
AC CP
AP
Not
Possible
51. 51
This is essential !
Consistency refers to the fact that all replicas of an entity, identified by a key in
the database, have the same value whatever the node queried
With many NoSQL databases, the prefered working mode is AP and all-the-time
consistency is sacrificed.
Favoring performance, updates take a little time to propagate across the cluster. When
an entity’s value has just been created or modified, there is a short span during which
the entity is not consistent.
However the cluster guarantees that it will eventually be, when replication has
occurred. This is called eventual consistency
Eventual Consistency
53. 53
A NoSQL - originally referring to "non-SQL" for "non-relational“ - database provides a mechanism for storage
and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
Such databases have existed since the late 1960s, but the name "NoSQL" was only coined in the early 21st
century, triggered by the needs of Web 2.0 companies.
NoSQL databases are increasingly used in Big Data and Real-Time Web applications.
NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query
languages or sit alongside SQL databases in polyglot-persistent architectures.
NoSQL / NewSQL
The fundamental idea behind NoSQL is as follows:
because of the need to distribute data (Big Data), the Web giants have abandoned the whole idea of
ACID transactions (only eventual consistency is possible)
So if we drop ACID Transactions - which we always deemed to be so fundamental - why wouldn't we
challenge all the rest - the relational model and table structure?
Wikipedia - https://en.wikipedia.org/wiki/NoSQL
54. 54
For data fundamentally structured as tabular data et of a
manageable size, the relational model fits.
For instance:
Accounting Data
Customer information
But some other data are modeled in a much more complex way
Geospatial data
Molecular models
Some underlying notions there are fundamentally not relational
Hierarchical data
Several levels of interconnections
In addition, some data models have a high volatility and required
flexibility over time
Information available at the time of the creation of the model are
sometimes incomplete
Or there inherent structure changes over time
The relational model is not well suited for data experiencing constant structural changes
The relational model is not always well suited
56. 56
NoSQL Database Types
Document-oriented (e.g. MongoDB, ES)
Key/Value pairs (e.g. Redis)
Graph (e.g. Neo4J)
Column-family aka BigTable (e.g. Cassandra)
One key has one (and only one) value
The Value type is not specified (Object value)
A Value may have different type
Issue : difficult to fit a model in this modeling pattern
Row = a set of columns
Sorted vertical storage
Operations
Query by key or set of key
Allowing query on secondary indexes
Selection of the resulted columns
The column-family model looks a bit like the relational model
For a given row, the contents of a column can thus be seen as a hash table
with arbitrary (key, value) pairs
Each row in a table is uniquely identified by a key
Documents are structured data in the form of
hierarchical trees (sub-documents)
Data can be of various types
Strings, numbers, arrays
Documents are self-supporting
It contains meta-data about the structure and the
corresponding values
Several storage formats for the document
XML, JSON, BSON
In this model, objects are documents, i.e. trees of
values
Each document has a root and attributes
Attribute values are scalars (integers, strings), lists
or other objects
Each object has a unique ID, a conventional
property whose value serves as a key
Objects are organized into collections. Objects in the
same collection don’t need to have the same schema
– there is no mandatory structure
Based on the interconnection of data (contrary to the other NoSQL
solutions which do not support relations)
Data are not only linked to nodes but also to edges (property graph)
59. 59
What is NewSQL ?
NewSQL refers to relational databases that have adopted upon some of the NoSQL genes, thus exposing
a relational data model and SQL interfaces to distributed, high volume databases
NewSQL, contrary to NoSQL, enables an application to keep
The relational view on the data
The SQL query language
Response times suited to transactional processing
Some were built from scratch (e.g. VoltDB), others are built on top of a NoSQL data store (e.g. SQLFire,
backed by GemFire, a key/value store)
The current trend is for some proven NoSQL databases, like Cassandra, to offer a thin SQL interface,
achieving the same purpose
Generally speaking, the frontier between NoSQL and NewSQL is a bit blurry… SQL compliance is often
sought for, as the key to integrating legacy SQL software (ETL, reporting) with modern No/NewSQL
databases
NewSQL?
61. 61
Hadoop is an Open Source Platform providing
A distributed, scalable and fault tolerant storage system as a grid
Initially, a single parallelism paradigm : MapReduce to reuse the storage nodes as processing nodes
Since Hadoop V2 and YARN, other parallelization paradigms can be implemented on Hadoop
Schemaless and optimized sequential write once and read many times
Querying and processing DSL (Hive, Pig)
Hadoop ?
Hadoop is declined in
different distributions
Fondation Apache
Cloudera
HortonWorks
MapR
IBM
…
The Hadoop’s origins
Initiated by Doug Cutting, leader of Lucene
Based on the Google’s publications about their
indexing system (GFS / Map Reduce / BigTable )
Official Apache project since 2009
Hadoop was primarily intended for Big Data Analytics
Nowadays hadoop can be an infrastructure for much more
Microservices architecture (Hadoop V3)
Real-time Architectures
62. 62
Hadoop Distribution
Hadoop overview
Distributed storage
MapReduce processing engine /
Parallel Computing Framework
Querying Orchestration
Machine learning /
Processing
IS
integration
Supervision
and
Management
Reporting
(Core)
66. 66
Vision of a data lake
With the continued growth in scope and scale of analytics applications using Hadoop and other data
sources, then the vision of an enterprise data lake can become a reality.
In a practical sense, a data lake is characterized by three key attributes:
Collect everything. A data lake contains all data, both raw sources over extended periods of time as well as
any processed data big volumes
Dive in anywhere. A data lake enables users across multiple business units to refine, explore and enrich data
on their terms you don’t know, a priori the analytical structures
Flexible access. A data lake enables multiple data access patterns across a shared infrastructure: batch,
interactive, online, search, in-memory and other processing engines.
As a result, a data lake delivers maximum scale and insight with the lowest possible friction and cost.
Data lake
A data lake is a system or repository of data stored in its natural/raw format
It's is usually a single store of data including raw copies of source system data, sensor data, social data
etc. and transformed data used for tasks such as reporting, visualization, advanced analytics and
machine learning.
It can include structured data from relational databases, semi-structured data (CSV, logs, XML,
JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
Wikipedia - https://en.wikipedia.org/wiki/Data_lake
67. 67
Datalake Application Architecture
Unstructured Data Storage
Semi-structured data storage
(NoSQL)
Structured Data storage (e.g.
relational)
Interactive Queyring Analytics / Processing Flow Processing
Machine Learning
Databases Raw files Application
logs
External Data / Open
APIs
Events /
Messages
Enterprise DWH Operational
System
Query /
Reporting
APIs / Services Events /
messages
DATA
LAKE
INGESTION
PUBLICATION
69. 69
Definition
A real time system is an event-driven system that is available, scalable and stable, able
to take decisions (actions) with a latency defined as … below the frequency of events
In a streaming architecture …
Historical data is regularly and consistently updated with live data
Live data is available to the end user
Both types or data (historical and live) are not necessarily presented consistently to the
end user
Both sets of data can have their own screens or even application
A consistent view on both sets of data would be proposed by Lambda Architecture (next topic in
this presentation)
Streaming Architectures
70. 70
Complex Event Processing Engine
decision /
action
Transactional
Applications
BPM, ESB
Capture
Streaming Architecture
In memory states and
Calculations:
Time window,
operators, rules
Rules edition GUI
Cache / Distributed Cache
latency : 100 ms
Event/Condition/Action
Stream-based querying
multi-dimen. Analysis
…
Real-time Data GUI
Historical Data GUI
Structured
Events
Unstructured
Events
Reference Data, DWH,
Services Querying
Event
History
71. 71
Complex Event Processing Engine
decision /
action
Transactional
Applications
BPM, ESB
Capture
Streaming Architecture
In memory states and
Calculations:
Time window,
operators, rules
Rules edition GUI
Cache / Distributed Cache
latency : 100 ms
Event/Condition/Action
Stream-based querying
multi-dimen. Analysis
…
Real-time Data GUI
Historical Data GUI
Structured
Events
Unstructured
Events
Reference Data, DWH,
Services Querying
Event
History
Stakes :
- Latency Management ( < 100 ms )
- Throughput( 10’000 msg / sec )
- Memory Consumption
- Balancing and Replication
- Fault Tolerance
- State coherence
- What about lost events ?
- Init from historical data
Stakes :
- Dynamical GUIs
- Data exploration and following axes and
criteria,
- Real-time GUI : event-driven of type « web-
push »
Stakes :
- High read performances in
respect to latency
- Good cache management
Stakes :
- High capacity
- High write performances
- High historical data querying
Performances
- Flexible Design abilities
Stakes:
- « WYSIWYG » editor, usable by business users
- « Hot » updates of rules
- Backtesting
Stakes
- Throughput (10’000 msg/sec )
- Fault tolerance : messages retry?
73. 73
Real-Time Analytics
What if I want real-time analytics ?
• Most Data Analytics software are batch processing solutions!
• So what happens with updates occurring while a batch is running?
• What happens between two of its executions ?
Objectives:
• Take all the data into account
• Be able to answer any kind of request
• Fault-tolerance
• Robustness to evolutions, errors
• Scalability !
• Low latency for writing AND reading
PROCESSED DATA
DATA THAT CAME AFTER THE
START OF THE CURRENT BATCH
Time
More or less a few
minutes to a few hours of
data
A few minutes to a
few hours of data
74. 74
λ (Lambda) Architecture
CONSISTENT
BATCH ANALYTICS ON
COMPREHENSIVE DATA
REAL-TIME / STREAMING
ANALYTICS ON
INCREMENTAL DATA
DATA
STREAM
STORAGE OF PRE-
COMPUTEDS RESULTS /
VIEWS OF THE DATA
STORAGE OF
INCREMENTAL RESULTS /
VIEWS OF THE DATA
To Real-Time Analytics with Near-Real-Time background statistics and models
SPEED LAYER
BATCH LAYER Final latency
< 1second
QUERYING AND
REPORTING
TOOL
AGGREGATION,
MERGING AND
CONSOLIDATION
SERVING LAYER
The batch layer is responsible for consistency and data storage on the long term
The speed layer only analyzes the required time-window
The gap between the last batch execution and the latest real-time data only most recent data.
Both layers produce the same output (unlike usual streaming architectures)
The serving layer provides a consolidated view on both results
75. 75
λ (Lambda) Architecture
CONSISTENT
BATCH ANALYTICS ON
COMPREHENSIVE DATA
REAL-TIME / STREAMING
ANALYTICS ON
INCREMENTAL DATA
DATA
STREAM
STORAGE OF PRE-
COMPUTEDS RESULTS /
VIEWS OF THE DATA
STORAGE OF
INCREMENTAL RESULTS /
VIEWS OF THE DATA
Many solutions for all components
SPEED LAYER
BATCH LAYER
QUERYING AND
REPORTING
TOOL
AGGREGATION,
MERGING AND
CONSOLIDATION
SERVING LAYER
D3.js
HighCharts
Tableaux
Storm DRPC
Java API
Flink
76. 76
κ (Kappa) Architecture
REAL-TIME / STREAMING
ANALYTICS ON
INCREMENTAL DATA
DATA
STREAM
RELOAD OF PREVIOUS
RESULTS / VIEWS OF THE
DATA
STORAGE OF
INCREMENTAL RESULTS /
VIEWS OF THE DATA
Recent Stream Processing Technologies render the batch layer less required
UNIFIED STREAMING LAYER / TECHNOLOGY Final latency
< 1second
QUERYING AND
REPORTING
TOOL
AGGREGATION,
MERGING AND
CONSOLIDATION
SERVING LAYER
Kappa architecture is a streaming-first architecture deployment pattern
With most recent Stream Processing technologies (Kafka Streams, Flink, etc.) the interest and relevance of the batch
layer tend to diminish. The streaming layer matches computation abilities of the batch layer (ML, statistics, etc.) and
stored data as it processes it.
A batch layer would only be needed to kick-start the system on historical data (Flink can do that)
78. 78
Big Data 2.0
2012
2011 2014
Nowadays in 2021 :
With Hadoop 3, these 3 technologies converge tend to converge to the same possibilities. Hadoop 3 supports
deploying jobs as docker containers just as Mesos and K8s
Mesos and Kubernetes can use alternatives to HDFS such as Ceph, GlusterFS, Minio, (of course Amazon,
Azure, …) etc.
However, Kubernetes (and/or technologies based on Kubernetes) emerge as a market standard for the
Operational IS just as Hadoop remains a market standard for the Analytical IS
79. 79
Kubernetes is an Open Source Platform providing
Automated software applications deployment, scaling, failover and management across cluster
of nodes
Management of application runtime components as Docker containers and application units as Pods
Multiple common services required for service location, distributed volume management, etc. (pretty
much everything one requires to deploy application on a Big Data cluster)
Kubernetes
Kubernetes is emerging as a
standard as a
Cloud Operating System
Many distributions
PKS (Pivotal Container Service)
Red-Hat OpenShift
Canonical Kubernetes
Google / AWS / Azure …
…
Kubernetes origins
Based on Google Borg, (one of) Google’s
initial cluster management system(s)
Released as Open-Source component in
Google in 2014
First official release in 2015
80. 80
Kubernetes Architecture
Client
Applications
Client
Applications
Client
Applications
(Secondary Master Node [HA])
(Master Node)
API Server
Control
Plane
Etcd
Key – Value Store
Controller Manager
Kubctl
Port
Forward
Load
Balanc.
Controller
Node
Kubelet
App
App App App App
POD
POD
Volumes
CR1 CR2 GR1 GR3
Ceph Gluster
Kube-Proxy
Docker
Node
App App
App App App
POD
POD
Volumes
CR2 CR3 GR1 GR2
Ceph Gluster
Docker
Node
App
App App App App
POD
POD
Volumes
CR1 CR3 GR2 GR3
Ceph Gluster
Docker
cAdvisor Kubelet Kube-Proxy
cAdvisor Kubelet Kube-Proxy
cAdvisor
KubeMQ
KubeMQ
KubeMQ
82. 82
Microservice architecture – a variant of the Service-Oriented Architecture (SOA) structural style – arranges an application
as a collection of loosely-coupled services. In a microservices architecture, services are fine-grained and the protocols are
lightweight. Its characteristics are as follows:
Services in a microservices architecture (MSA) are small in size, messaging-enabled, bounded by contexts,
autonomously developed, independently deployable, decentralized and built and released with automated
processes.
Services are often processes that communicate over a network to fulfill a goal using technology-agnostic protocols such
as HTTP.
Services are organized around business capabilities.
Services can be implemented using different programming languages, databases, hardware and software environment,
depending on what fits best.
Microservices Architecture
Origins of Micro-services:
As early as 2005, Peter Rodgers introduced the
term "Micro-Web-Services" during a presentation
at the Web Services Edge conference.
The architectural style name was really adopted
in 2012
Kubernetes democratized the architectural
approach
The two big players in this field are Spring
Cloud and Kubernetes
A Microservices-based architecture has the following properties:
Lends itself to a continuous delivery software development
process. A change to a small part of the application only
requires rebuilding and redeploying only one or a small
number of services.
Adheres to principles such as fine-grained interfaces (to
independently deployable services), business-driven
development (e.g. domain-driven design).
Wikipedia - https://en.wikipedia.org/wiki/Microservices
Martin Fowler
83. 83
Microservices Architecture
Client
Applications
Client
Applications
Client
Applications
Master Node
API
Gateway
Service Catalog / Discovery
Management / Orchestration
Node
Node Mgmt.
Execution middleware
Service Proxy
Node Node
Distributed Storage
R1 R2
Distributed Storage
R1 R3
Distributed Storage
R2 R3
Execution middleware Execution middleware
Service B
Service C
Service A
Service D
Service E
Microservices
Node Mgmt. Service Proxy Node Mgmt. Service Proxy
MQ MQ MQ
Static Content
84. 84
Ask yourself : do you need microservices ?
Microservices are NOT Big Data ! [co-local processing]
You don’t need microservices or Kubernetes to benefit from Docker
You’re not scaling anything with synchronous calls
Don’t do microservices unless:
You need independent service-level scalability (vs. storage / processing scalability – Big Data)
You need a strong SOA - Service-Oriented Architecture
You need independent services lifecycle management
Challenges
Distributed caching vs reloading the world all over again
Not all applications are fit for asynchronous communications (WYCIWYG)
Identifying the proper granularity for services
Enterprise architecture view is too big
Application architecture view is too fine
RIA Organizer : good candidates would be : EmailService, CalendarService, ContactService, SearchService
Data consistency without distributed transactions. Applications need to be designed with this in mind.
Weighting the overall memory and performance waste
A Spring boot stack + JVM + Linux Docker base for every single service ?
HTTP calls in between layers ?
Microservices discussion
86. 86
The Strong frontier between Operational IS and Analytical IS vanishes
NoSQL, Streaming, Lambda and Kappa architectures are increasingly overflowing to the
Operational IS and as such provide a common ground for operational processes and
analytical processes.
Historically strong on the BI Side, Hadoop (V3) fits well nowadays for needs of the
Operational IS while Kubernetes can be useful on the Analytical IS
Kubernetes (also Mesos, etc.) is a cloud Operating System, but not only (distribution,
scaling run your cloud locally)
Don’t do Micro-Services unless you need Micro-Services … otherwise just do services :-)
Final notes …
Operational Information System BI
X
Motivation :
D’un côté : IS operationel et son modèle 3 tiers et IS analytics avec son modèle push a J – 1
De l’autre : les micro-services à tort et à travers
Prendre un peu de recul et comprendre ce que la technologie permet et apporte
D’abord sommaiement parcourir des modèles de description d’architecture et introduire un outil qui m’accompagne depuis de nombreuses années dans mon travail d’architecte
Agenda
Typically, the Architectural Design decisions are related to key aspects :
Structural : Typically, "The architectural elements should be organized like this ...”
Behavioural : For instance, "Data processing, storage and visualization will be performed in strict sequence.
Interaction : For instance, "Communication among all system elements should occur only using event notification.”
Non-functional : For instance, "The system's reliability will be ensured by replicating modules."
A process to design a high level solution – un process qui malheureusement n’est pas documenté sur wikipedia – la compréhension de ce process provient de l’expérience mais est supporté par les deux outils qu’on va voir dans un moment
Un produit – la description de l’architecture d’un système çA ne peut pas être un schéma. C’est souvent plusieurs schémas, parfois plusieurs fois le même mais avec des perspectives variants un peu, des spécifications fonctionnelles et non-fonctionnelles, des documentations techniques, etc.
Des moyens, des socles techniques, des librairies technoiques ou fonctionnelles, des middlewares etc
Mais c’est avant tout une réalité. L’architecture d’un système se définit avant tout par le système en fonctionnement
et l’architecte est la personne qui construit ce système, pas la personne qui fait des schémas dans son bureau
Enterprise Architecture vs Application Architecture
L’architecture d’entreprise identifiie comment les différentes applications d’un Système d’information se comportent ensemble contre comme les différents composants se comportent au sein d’une application pour l’architecture applicative.
La meilleure image pour comprendre ceci est de considérer l’architecture d’entreprise comme le plan d’une ville tandis que l’architecture d’une application serait le plan d’un immeuble
Il y a des différences entre ces deux métiers comme les challenges à adresser, le scope et les sujets traités
Mais il y a aussi des grandes similarités, commes les outils à dispositions pour les décrire et les questions à se poser pour identifier les éléments décisionnels
…
L’architecture n’est pas tout à fait du design et le design n’est pas tout à fait de l’architecture.
Mais la frontière entre ces deux mondes est subtil et surtout floue. Aussi, cette frontière dépend de la perspective, de son interprétation au sein d’une équipe, etc.
Neal Ford : “Architecture is about stuff that’s hard to change later”
Moi ça me parle. Pour moi l’architecture s’arrête aux décisions structurantes – aussi bien fonctionnelle que non-fonctionnelles - sur le produit à construire ou le système d’information dans son ensemble. Les éléments qu’n peut changer plus tard, qu’on peur refactorer, sont du design, pas de l’architecture.
Logical View
…
- Fonctionalités et découpe en fonctionalités => identifier les blocs fonctionnels et leur matérialisation.
Décrire ou matérlaiser les relations entre blocs fonctionnels
Pour moi, la vue logique est intimement liée à la story map même si la granularité et la cardinalité peuvent varier
Process View
…
concrètement, on va chercher à identifier comment les blocs technico-fonctionnels intéragissent entre eux pour réaliser les fonctionalités attendues.
Pour ce faire, on va prendre en compte les contraintes fonctionnels mais aussi non-fonctionnelles (performances, scalabilité, la distribution, etc.)
Implementation View
…
Vue du développeut ou on va vouloir voir les packages, les stéréotypes mais aussi répondre à des éléments de gestion du source code.
De mon point de vue, c’est la seule vue du modèle de Kruchten qui a l’époque des IntelliJ, de git et de maven n’est peut être plus tout à fait pertinente et on va voir une approche alternative dans un moment
Physical View
…
C’est vraiment l’architecture système … celle où on pose les composants logiciels et système sur les machines sur lesquels on déploie l’applicatif
Scenario View
Montrer comment tous ces éléments des vues précédentes fonctionnent ensemble pour réaliser les fonctionalités
De plus en plus, la vue scénaio est une dérivation de la story map… ou on la laissse même complptement tomber au profit d’une description des user stories. => je ne vais pas plus m’y attarder
=> On trouve bcp de documentation sur les vues de Kruchten et le 4 + 1 View model online
=> Donner quelques exemples de vue et du design attenant
…
- Regrouper les composants fonctionnels par catégorie business/backend ou presentation/UI
Utiliser un code couleur pour la famille fonctionnelle
Montrer les associations les plus importantes
Montrer des layers – c’est un choix, pas forcément pertinent sur de l’archi fonc.
Aussi décider de montrer quelques composants techniques car ils réalisent des éléments fonctionnels importants
Au final, j’ai décidé de réaliser un schéma qui me permet de
Présenter une découpe fonctionnelle des composants logiciels
Communiquer sur la façon don’t ces composants vont porter les fonctionalités essentiels : éditer un email, afficher un email, sauvegarder un email, envoyer un email , etc.
…
Kruchten take aways.
- Les 4 + 1 vues de Kruchten forment une formalisation des perspectives à décrire en software archtecture.
Un outil intéressant et tjrs d’actualité (à pa peut-être la vue implémentation …)
Ma critique serait :
Bcp de gens se sont evertués à discuter le formalisme en étudiant Kruchten
Le formalisme n’a aucun intérêt … cercles --- ASCII art …
Un bon outil pour faire de l’architecture doit permettre de se poser les bonnes questions
Proposer un autre outil
La vue implémentation me déplait, l’architecture est une formalisme abstrait pour communiquer, pas nécessairement qq chose qui s’évertue à décrire une réalité technique
Finalement, le formalisme du modèle à 4 + 1 vues (basé sur UML) tend naturellement à déborder de l’architecture sur le design (au niveau applicatif)
Consumerization : new information technologies emerge first in the consumer market and then spread into businesses
This is a change compared to the previous situation
Companies used to have better servers/desktop/applications/... than those employees could buy at home
Now, new solutions emerge every month : companies can't keep up
New trend : employees are hired with their devices and their applications
BYOD trend : employees are more comfortable and more efficient with their own devices
Same power in an iPad now than in a Cray a few years back
This consumerization can be found in infrastructures too and is an enabler for the consumer market
A direct consequence of the consumerization: use of a mix of professional and personnal tools by employees (Office Suite, Gmail, Google+, Twitter, Facebook, Dropbox, evernote, ...)
Nowadays several companies are still blocking acccess to these tools from their employees (private banks). Tomorrow, that won’t be possible anymore.
People are used to be connected all the time, with highly efficient devices on highly responsive services, everywhere and for all kind of uses.
The revolution came from the web giants. They had to find technical answers to business challenges like :
GGL : Index the whole web, and keep a response time to any below one second - or how to keep the search free for the user ?
LINK : understand how millions of users use their website ?
AMZ : how to build a product recommendation engine for millions of customers, on millions of products ?
EBAY : how to do a search in ebay ads, even with misspelling ?
Since we started estimating and measuring the amount of produced data until 2003, 5 exabytes (5 billions gigabytes) have been produced.
In 2011, that quantity was generated in 2 days (think of facebook, twitter, google searches logs, financial transaction logs, etc.)
In 2014, this quantity is generated in 10 minutes.
Not only do we generate more and more data
We have the means and the technology to analyze, exploit and mine it and extract meaningful business insights
The data generated by the company’s own systems can be a very interesting source if information regarding customer behaviours, profiles, trends, desires, etc.
But also external data, facebook, twitter logs, etc.
Twitter story : Uber car transportation system in Paris. A driver has refused to carry a customer because the customer was gay. That customer twitted his misadventure. The driver got excluded by Uber only a few hours later.
Instead of harming Uber’s reputation, the story rather gave it credit.
Just an example on how a company can get significant advantages by monitoring social network feeds
For a long time, the increasing volume of data to be handled has not been an issue
The volume of data rises, the number of user rises
The processing abilities rise as well, sometimes even more
See the Moore low above
This model has hold for a very long time.
The cost are going down, the computing capacities are rising, one simply needs to buy a new machine to absorb the load increase.
This is especially true in the mainframe
There wasn’t even any need to make the architecture of the systems (COBOL, etc.) evolve for 30 years
Even outside the mainframe world
The architecture patterns and styles we are using in the operational IS world haven’t really evolve for the last 15 years
Despite new technologies such as Web, Web 2.0, Java, etc. of course
I’m just speaking about architecture and styles
The analytical systems architecture hasn’t evolve for the last 20 years
So everything’s fine ?
No !
As we’ll see, at least two problems emerged relatively recently
1st concern : the throughput
We are able to store more and more data, no problem
Yet we are more and more unable to manipulate this data efficiently
Specifically, fetching all the data on a computation machine to process it is becoming more and more difficult
One challenge : how to handle the massive computation needs / massive amount of data ?
-> New architecture and paradigms are required
3 ideas …
Availability
Availability (or lack thereof) is a property of the database cluster. The cluster is available if a request made by a client is always acknowledged by the system, i.e. it is guaranteed to be taken into account
That doesn’t mean that the request is processed immediately. It may be put on hold. An available system will at a minimum acknowledge it
Practically speaking, availability is usually measured in percents. For instance, 99.99% availability means that the system is unavailable at most 0.01% of the time, that is, at most 53 min per year
Partition tolerance
Partition Tolerance is verified if a system made of several interconnected nodes can stand a partition of the cluster; if it continues to operate when one or several nodes disappear. This happens when nodes crash or when a network equipment is shut down, taking a whole portion of the cluster away
Partition tolerance is related to availability and consistency, but it is still different. It states that the system continues to function internally (e.g. ensuring data distribution and replication), whatever its interactions with a client
Consistency
When talking about distributed databases, like NoSQL, consistency has a meaning that is somewhat more precise than in the relational context
It refers to the fact that all replicas of an entity, identified by a key in the database, have the same value whatever the node queried
With many NoSQL databases, updates take a little time to propagate across the cluster. When an entity’s value has just been created or modified, there is a short span during which the entity is not consistent. However the cluster guarantees that it will eventually be, when replication has occurred. This is called eventual consistency
GFS / Map Reduce – 2002 / BigTable 2006
Ex gisement de données/ réservoir de donénes, ou hub de données
Monde de la décision opérationnelle.
Potentiellement bcp de règles, à faire évoluer fréquemment.
Hors de question de renvoyer le tout à la MOE 3 mois :
on doit aller vite = analyste côté métier doit pouvoir les faire évoluer (= pas du dev)
pouvoir imaginer de nouvelles règles, les simuler sur l’historique (backtesting)
Monde de la décision opérationnelle.
Potentiellement bcp de règles, à faire évoluer fréquemment.
Hors de question de renvoyer le tout à la MOE 3 mois :
on doit aller vite = analyste côté métier doit pouvoir les faire évoluer (= pas du dev)
pouvoir imaginer de nouvelles règles, les simuler sur l’historique (backtesting)