Introduction to Modern Software Architecture

1
© Jerome Kehrli @ niceideas.ch
Introduction to
Modern Software Architecture

2
Part I – Software Architecture Models
1.1 Introduction to Software Architecture
1.2 Our illustration example
1.3 The Kruchten 5 + 1 View Model
1.4 The OCTO Matrix Approach
Part II - Modern Architectures
2.1 Big Data
2.2 The Death of the Moore Law
2.3 The CAP Theorem
2.4 NoSQL / NewSQL
2.5 Hadoop
2.6 Data Lake
2.7 Streaming Architecture
2.8 Lambda Architecture
2.9 Big Data 2.0 & Kubernetes
2.10 Microservices Architecture
Part III - Takeaways
Agenda

3
1.1 Introduction to Software Architecture

4
Definitions 1/3
A software system's architecture is the set of principal design decisions about the system
Software architecture is the blueprint for a system's construction and evolution
Design decisions encompass the following aspects of the system under development
Structure,
Behaviour,
Interactions,
Non-functional properties
Taylor 2010
"Principal” implies an a degree of importance that grants a design decision an
"architectural status".
This implies that not all design decisions are architectural. As such, these do not
necessarily impact a system's architecture.
How one defines principal depends on what the stakeholders define as the system
goals.

5
Definitions 2/3
An architecture is
the set of significant decisions about the organization of a software system,
the selection of the structural elements and their interfaces by which the system is
composed
together with their behavior as specified in the collaborations among those elements,
the composition of these structural and behavioral elements into progressively larger
subsystems,
and the architectural style that guides this organization, these elements and their
interfaces, their collaborations, and their composition.
RUP – Rational Unified Process

6
Definitions 3/3
In most successful software projects, the expert developers working on that project have a
shared understanding of the system design. This shared understanding is called
‘architecture’. This understanding includes how the system is divided into components and how
the components interact through interfaces.
Architecture is about stuff that’s hard to change later
Ralph Johnson
Neal Ford
Architecture is about the important stuff
Martin Fowler

7
Sidenotes
Any organization that designs a system (defined broadly) will produce a design whose structure
is a copy of the organization's communication structure.
Melvin E. Conway (Conway's law)
... all models are approximations. Essentially, all models are wrong, but some are useful.
However, the approximate nature of the model must always be borne in mind...
George Box

8
Software Architecture is
A Process : to design a high-level solution
A Product : schemas, models, documentation, prototypes
Means : frameworks, libraries, middleware, etc. to ease implementation
of large systems
A Reality : the working software or Information System
My View

9
Different Kind of Architectures
Enterprise Architecture Solution / Application Architecture
Enterprise Architecture defines the way the enterprise uses
several applications.
Metaphor : City Planning / City Map
Focus : Strategy / Business
Some Key Concerns:
- Uncover operational gaps
- Understand data-dependencies across the IT landscape
- Understand Interactions between Solutions / Applications
- Streamline the application landscape for optimal
performance
- Decommissioning of legacy solutions
- Eliminate redundancies
- Identify and avoid tech risks
Application architecture defines the various pieces that
compose an application
Metaphor : Building / House Architecture
Focus : Technology / Functional
Some Key Concerns:
- Define a best-fit solution for identified problems
- Ensure solution meets functional and non-functional
requirements
- Understand how application supports business
capabilities
- Understand functional fit, technical fit and risks
- Implement technical processes for Application
development

10
Architecture or Design
Architecture Design Implementation
Abstraction Fine Granularity / Reality
Process of creating High-level
structures of a software system
Converts the software
characteristics into a high-level
structure
Micro-services, serverless,
streaming, lambda are some
software architecture patterns
Helps define high-level structure
of the software system
Process of creating a form of
specification of a software artifact
that helps implement the software
Describes all units of a software
system to support coding
Creational, structural and
behavioural are some types of
software design-patterns
Helps implement the software

11
2 different visions of architecture

12
1.2 Our illustration example

13
Example – product vision canvas – RIA Organizer

14
Example – Story Map - RIA Organizer

15
1.3 The Kruchten 5 + 1 View Model

16
Philippe Kruchten defined a 4+1 Views Model to capture the description of Software
Architecture into multiple complementary views
in 1995 when he was working for Rational Software Corp.
The 4+1 views model is an information organization framework; it consists of logical,
process, development, and physical knowledge of an application, and end-user perspective
information.
A view is an aspect (subpart) of information.
A notion is a way of representing information.
The 4 + 1 Kruchten Views Model
Philippe Kruchten, Architectural Blueprints—The “4+1” View Model of Software Architecture
The “4+1” view model is rather “generic”: other notations and tools can be used, other design
methods can be used, especially for the logical and process decompositions, but we have
indicated the ones we have used with success.

Conceptual / Logic Physical / Operational
Non-functional
Functional
Logical / Structural View Implementation / Development View
Process / Behaviour View Deployment / Physical View
The logical view is concerned with the functionality
that the system provides to end-users.
UML Diagrams used to represent the logical view
include Class diagram, Communication diagram,
Sequence diagram.
The development view illustrates a system from
a programmer's perspective and is
concerned with software management. This
view is also known as the implementation view.
It uses the UML Component diagram to
describe system
components.
UML Diagrams used to
represent the development view
include the Package diagram.
The process view deals with the
dynamic aspects of the system,
explains the system processes and
how they communicate, and focuses on the runtime
behavior of the system. The process view addresses
concurrency, distribution, integrators, performance,
and scalability, etc. UML Diagrams to represent
process view include the Activity diagram.
The physical view depicts the system
from a system engineer's
point-of-view.
It is concerned with the topology of
software components on the physical layer, as
well as communication between these
components.
This view is also known as the deployment view.
UML Diagrams used to represent physical view
include the Deployment diagram.
Use Case / Scenario View
The description of an architecture is illustrated using a
small set of use cases, or scenarios which become a
fifth view. The scenarios describe sequences of
interactions between objects and / or processes. They
are used to identify architectural elements and to
illustrate and validate the architecture design. They also
serve as a starting point for tests of an architecture
prototype. UML Diagram(s) used to represent the
scenario view include the Use case diagram.

Conceptual / Logic Physical / Operational
Non-functional
Functional
Process / Behaviour View
Perspective: System Integrators
Stage: Design
Focus: Process decomposition
Concerns: Performances, Scalability,
Throughput, Synchronization, Concurrency
Artifacts:
- Sequence Diagrams / Activity Diagrams
- Communication / interactions diagrams
- State Machine Diagrams
- Timing Diagrams
Logical / Structural View
Perspective: End Users , Business
Analysts
Stage: Requirements Analysis
Focus: Components / Objects / Services
Model - Decomposition
Concerns: Functionality
Artifacts:
- Functions Schema
- Class / Objects Diagram
- (composite) Structure Diagram
- State Machine
Implementation / Development View
Perspective: Developers,
Designers
Stage: Design
Focus: Subsystem
decomposition
Concerns: Software /
Configuration Management
Artifacts:
- Components Diagram
- Package Diagram
Deployment / Physical View
Perspective: System Engineers
Stage: Design
Focus: Software mapping to
Hardware (deployment)
Concerns: System Topology,
Delivery, Installation,
Communication
Artifacts:
- Deployment diagram
- Network / Cluster topology
(not UML)
Use Case / Scenario View
Perspective: End User
Stage: Putting it all together
Focus: Understandability , usability
Concerns: Feature Decomposition
Artifacts:
- Use-case diagrams
- User Stories (not UML)
- Story Maps (not UML)

20
RIAO Process View – Send Email

21
RIAO Process View – Fetch new Emails

23
User Computer RIA Server
Tomcat (Spring Boot)
RIAO Physical View
Apache
Proxy
Web browser
RIAO UI RIAO Backend
HTTPS
Courier Server
Courier / Debian
POP3
SMTP
HTTP
(User OS) Debian Linux
MongoDB
Node
MongoDB
Node
Mongo Node
Docker
Debian Linux
Integration
Processing / Business
Presentation
FirewallD
Open JDK 11 / JVM
Loc. Storage
Internet Internal Network
SystemD
Kubernetes Cluster
K8s
service
Locator

24
1.4 The OCTO Matrix Approach

25
OCTO Technology designed in 2010 a matrix that presents a 360 overview of
most-if-not-all questions, concerns and aspects that need to be answered and
addressed when defining a Software Architecture
The OCTO Architecture Matrix
The questions and concerns are
related to different levels of
architecture:
Functional
Application
Technical
System
They regroup different perspectives:
Security
Usage
Services
Data
Exchanges

Security Usage Services Data Exchanges
Procedures / Specifications Schema / Models / Catalogs Technical Documentation
Functional
Perspective CONFORMITY
Procedures and rules aimed at
ensuring the security of the
components
Code of conduct, role rights, user
groups, documented procedures,
disaster recovery plans,
geographical accesses strategy, …
USAGE
Use cases per persona: Customer,
Partners, Advisors, etc.
User profiles, User experience,
work ergonomics, internet and new
channels strategy, digital strategy,
Business cases, Business Use
Cases, etc.
FUNCTIONS
Functions and management
rules within the company
Functional architecture schema,
Functional map, operational
processes, management rules,
calculation rules, user guides,
etc.
INFORMATION
Information handled within the
company
Data architecture, Data
governance, functional dictionary,
data models, data modeling rules,
etc.
PROCESSES AND EXCHANGES
Processes and exchanges
internal and with partners.
Data flows, modeling, internal and
external exchanges, functional
workflows, operational data
workflows, etc.
Application
Perspective
SECURITY
Components aimed at ensuring the
security of the Information System
Authentication, Identification,
Authorization, management of
credentials and access rights,
provisioning, audit trails, security
procedures referential, etc.
USAGE FLOW
Applications / Modules /
Components accessed by users
Business Software Components
accessed by users, internet /
intranet / mobile portals, call
center, BI systems, mail and
internal services, etc.
PROCESSING
Components implementing the
processing of the IS
Application schema and map,
processes model reference,
Business Model, micro-services,
management rules model
reference, etc.
DATA SILOS
Data repositories, data
referentials, etc.
Data referential, Data dictionary,
Data warehouse, Data model
reference, Document model
reference, audit trails, archives,
etc.
DATA FLOWS
Data exchanges processes and
means
Interoperability dictionary,
exchanges standards and formats,
document edition, APIS reference,
External APIS, , etc.
SECURITY FRAMEWORK
Technical means and components
implementing security principles
Authentication mechanisms, Rights
management components,
confidentiality protocols and
means, cryptographic means etc.
Technical
Perspective
GUI FRAMEWORK
Technical means and components
providing the GUI and user tools
GUI technologies, Reporting
tools, GUI design model, tools and
standards, etc.
SERVICES FRAMEWORK
Technical means for executing the
services
Technological and app server
stacks, processing middleware(s),
technical layers, frameworks,
toolkits, libraries, rules engines,
external components, etc.
DATA FRAMEWORK
Technical means for accessing
and storing data
DBMS, LDAP technologies,
Data modeling tools,
etc.
EXCHANGES FRAMEWORK
Technical means for exchanges
EAI, ESB, ETL, Workflow engines,
API frameworks and libraries, file
transfers, flow design, etc.
System
Perspective
SYSTEM SECURITY
Physical means and tools
implementing the network and
security
LAN, WAN, remote access, VPN,
firewall, DMZ, proxy, I/O hub,
journals, supervision tools,
authentication dongles, etc.
USER DEVICES AND MEANS
User equipment (PC, IP phone,
tablets, ...)
Communication means, user
computer, Office servers, remote
access middleware and software,
etc.
PROCESSING
INFRASTRUCTURE
Processing servers
and middleware
Servers, Datacenters, Load
balancers, Proxies, Clusters,
Monitoring Systems, Clouds, SLAs,
DRP, etc.
STORAGE INFRASTRUCTURE
Data Servers and Middleware
Data Servers, SAN, Archiving,
Robots, Storage Servers, RAID,
etc.
EXCHANGES
INFRASTRUCTURE
Middleware and Tools
Exchange Servers, Clustering
Middleware, Big Data Engines,
SLA,s Transfer Monitors, Service
Contract, DRP, Flow Management,
Replay, Monitoring, etc.
Source : https://fr.slideshare.net/OCTOTechnology/2012-pdj-banque-du-futur-2020
© OCTO Technology

27
RIAO Functional Architecture
Email
Management
Contact
management
Email Search
Search
Email
Display
/
Edition Folder
Management
Global App. Email Application
Appointment
Display
/
Edition
Calendar
Display
/
Edition
Calendar Application
Folder
Display
/
Edition
Contact
Display /
Edition
Calendar
management
Appointment
Management
Contact
Search
Calendar
Search
Contact App.
Login
User
Management
Appointment
Mapping
Business /
Entry Points
User
Interactions
Services
&
Functions
Mgmt.
Search
Text
Compos.
HTML
Comp.
RTF
Compos.
Text
Display
HTML
Disp.
RTF
Display
Text
Compos.
HTML
Comp.
RTF
Compos.
Text
Display
HTML
Disp.
RTF
Display
Attachment
Management
Email
IO

28
GridFS
Folder
Model
Email
Docs.
Attachem-
ent files
Appoint-
ment Docs.
Calendar
Model
Contact
Documents
User
Model
SMTP
Server
POP3
Store
RIAO
Backend
RIAO
UI
RIAO Application Architecture
Search
Email IO
Appointment Mapping
User
Model
Folder
Model
Email
Model
Calendar
Model
Appointm.
Model
Contact
Model
Attachem.
Model
Appointm.
Search
Email
Search
Contact
Search
Search Mgmt.
User
Mgmt.
Folder
Mgmt.
Email
Mgmt.
Calendar
Mgmt.
Appointm.
Mgmt.
Contact
Mgmt.
Attachem.
Mgmt.
Email Synchronization
Deleg.
Search
Service
User
Service
Email Service Calendar Service
Contact
Service
Login
Page
Profile
Edition
Folder
View
/
Edit
Email
Compos.
Email
View
/
Edit
Calendar
View
/
Edit
Appoinlt.
Compos.
Appoint.
View
/
Edit
Contact
View
/
Edit
Contact
Model
CRUD Fetch
Send
Search Page
REST API
Data
/
Exchanges
Integration
Busi-
ness
Presentation
APIs and Process Orchestration
CRUD
RTF
Display
RTF
Compos.
HTML
Compos.
Deleg.
Email Application Calendar App.
Contact
App
Text
Compos
Email
Model
Calendar
Model
Email Controller Calendar Controller Contact Ctr.
Search Control.
User Ctrl.
Loc. Storage
Main Page

29
User
tier
Proces-
sing
tier
Integration
Tier
Web
browser
RIAO
UI
RIAO
Back.
RIAO Technical Architecture
HTTP
UI Controllers
JAX-RS / HTTPS
Java VM
Apache Proxy
Business managers
MongoDB Client
Views
JQuery CKEditor
Bootstrap
Business Services
DAOs
SMTP Client POP3 Client
Courier / Debian
IO Management
SMTP POP3
Spring
Boot
/
Tomcat
8
Runtime
Forms
Models
Linux
Debian
Spring
Security
Spring
Framework
Apache
Commons
SSL Cert.
Local
Store
Sess.
Ckie.
JAX-RS / HTTP
Main
Page
Obj./JSON
Map.
JSON / Object Mapping

30
User Computer RIA Server
Tomcat (Spring Boot)
RIAO System Architecture
Apache
Proxy
Web browser
RIAO UI RIAO Backend
HTTPS
Courier Server
Courier / Debian
POP3
SMTP
HTTP
(User OS) Debian Linux
MongoDB
Node
MongoDB
Node
Mongo Node
Docker
Debian Linux
Integration
Processing / Business
Presentation
FirewallD
Open JDK 11 / JVM
Loc. Storage
Internet Internal Network
SystemD
Kubernetes Cluster
K8s
service
Locator

32
The era of power
Cray 2 / 1985 / ~1.9 GigaFlops Samsung S6 / 2015 / ~30 GigaFlops
Source : https://pages.experts-exchange.com/processing-power-compared

33
Origins of Big Data : the web giants !

34
Data deluge
5 exabytes of data
(5 billions of gigabytes)
has been generated
since the first measurements
until 2003,
In 2011, this
quantity was
generated in 2
days
In 2018, this
quantity was
generated in
2 minutes
Source: https://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf

35
Our architectures are 30 years old !
Corporate Operational Data
Internal GUI Space
Operational / Live Audit / Logs Archived Data …
Ext. Data
Staging Database
…
ETL
ETL
Datawarehouse
Storage
Cleaning / Cleansing / Enrichment / Remapping
Historize
Query
ETL
Reporting / Analytics / Querying
Data
Mart
Data
Mart
Data
Mart
Operational Application Space
Online
Business
Applications
Batch
Business
Applications
Monitoring /
Operation
Applications
External GUI Space
DMZ
Web
Apps
Desktop
Apps
Web
Apps
Mobile
Apps
Operational Information System Analytical Information System / Business Intelligence

36
2.2 The death of the Moore Law

37
The Moore law
“The number of transistors and
resistors on a chip doubles every
24 months”
- Gordon Moore, 1965

38
Technical capacitites evolution
For the 40 years, the IT component capabilties grew exponentially
The Moore law!
Source :
http://radar.oreilly.com/2011/08/building-data-startups.html

39
Storage cost evolution
While the unit cost is decreasing…
0.01 $
0.10 $
1.00 $
10.00 $
100.00 $
1,000.00 $
10,000.00 $
100,000.00 $
1,000,000.00 $
10,000,000.00 $
1975 1980 1985 1990 1995 2000 2005 2010 2015
Hard Drive
RAM
Source :http://www.mkomo.com/cost-per-gigabyte
2012
5$/GB
1982
5M$/GB

41
Disk throughput evolution
Issue : The throughput evolution is always lower than the capacity evolution
How read/write more and more data through an always thicker pipe?
Gain : x100 000
Capacity Gain:
x 10’000
In 15 years
Throughput Gain:
x 50
In 15 years

42
New architectures and paradigms
Key
Idea #1
Key Idea #2
Key Idea #3
Since the data is to big to
fit one computer,
distribute it among many
computer (partitioning /
sharding) !
Run transaction and computation in
parallel on multiple (many!) nodes
and scale at the multi-datacenter
level the grid of CPU, RAM and
HDD
Move the code to the
data node, not the data
to the computing node
(Data tier revolution)

44
The early days of digital data …
Before 1960, the data within a Computer Information
System was mostly stored in rather flat files
(sometimes indexed) manipulated by top-level software
systems.
Directly using flat files was cumbersome and painful…
Various needs emerge at the time :
Data isolation
Access efficiency
Data integrity
Reducing the time required to develop brand new
applications
 Something else was required …
A bit of history …

45
The relational model rules for 40 years !
E.g. an Exam Grade management app :
Display the subject of a student on his profile
screen, one needs to
1. Extract the personal data from the
“student” table
2. Fetch its subject if from the relation table
3. Read the subject title from the “subject”
table.
Enters the Relational Model …
1969 / Edgar F. Codd - RDBMS
Entities as Tables & associations
The relational model reduces redundancy to optimize disk
space usage
At the time of its creation
Disk storage was very expensive and limited
The volume of data in the Information Systems was rather
small
 avoid redundancy to optimize disk space usage, thanks to
guaranties of :
Structure: using normal design forms and modeling
techniques
Coherence: using transaction principles and mechanisms
Why, oh why, to separate these 2 kind of information since in 95% of the use
cases around these data, both will always be used together ?!?

46
The mid and late 2000’s were times of major changes in the IT landscape
Hardware capabilities significantly increased
eCommerce and internet trade, in general, exploded
Some internet companies, so-called the “Web giants” (Yahoo!, Facebook, Google, Amazon,
Ebay, Twitter, …), pushed traditional databases to their limits. Those databases are by
design hard to scale
With relational DBMSes, the only way to improve performance is by scaling up, i.e. getting
bigger servers (more CPU, more RAM, more disk, …). One eventually hits a hard limit
imposed by the current technology
The origins of NoSQL
Faster
More storage
More reliable
Investments
Hard limit
From a certain point,
investments yield little
improvement
Database server
Scaling up:

47
By rethinking the architecture of databases, those companies were able to make
them scale at will, by adding more servers to clusters instead of upgrading the
servers.
The servers are not made of expensive, high-end hardware; they are qualified as
commodity servers (or commodity hardware)
The origins (cont’d)
Faster
More storage
More reliable
Investments
Power grows linearly
with the number of
servers (linear
scalability)
Scaling out:
Database cluster

48
This is the essence of Big Data !
With most NoSQL databases, the data is not stored in one place (i.e. on one server). It is distributed
among the nodes of the cluster. When created, an object A is assigned to a node in the cluster. This is
called sharding – the amount of data assigned to a node is called a shard (also called partition)
Having more cluster nodes implies a higher risk of having some nodes crash, or a network outage splitting
the cluster in two. For this reason, and to avoid data loss, objects are also replicated across the clusters
The number of copies, called replicas, can be tuned. 3 replicas is a common figure
Data distribution
A B
C
D
A
A
B
B
C
C
D
D
The objects may move, as nodes crash or new nodes join the cluster, ready to take charge of some of the
objects. Such events are usually handled automatically by the cluster; the operation of shuffling objects
around to keep a fair repartition of data is called rebalancing

49
The CAP Theorem
Consistency
All clients see the exactly the same
data at the same time, even in the
presence of an update (ACID
Properties)
Availability
The system continues
to operate and all
clients can see “a
version” of a replica,
even in the presence of
node failure
Partition-
tolerance
The system continues to
operate even when the
system is partitioned (some
nodes are unavailable)
AC CP
AP
Not
Possible
Availability
The cluster is available if a
request made by a client is always
acknowledged by the system, i.e.
it is guaranteed to be taken into
account
That doesn’t mean that the
request is processed
immediately. It may be put on
hold. An available system will
at a minimum acknowledge it
Client
Request
Acknowledgement
?
Partition tolerance
Partition Tolerance is verified
if a cluster can stand a
partition; if it continues to
operate when one or several
nodes disappear. (nodes crash,
network equipment down, etc.)
Partition tolerance is related to
availability and consistency, but
it is still different. It states that
the system continues to
function internally (e.g. ensuring
data distribution and
replication), whatever its
interactions with a client
Consistency
Consitency refers to the fact that all replicas
of an entity, identified by a key in the
database, have the same value
whatever the node queried
old version
new version
new version
new version
Client
Update

50
The previous 3 properties, Consistency, Availability and Partition tolerance, are not independent. The CAP
theorem - or Brewer’s theorem - states that a distributed system cannot guarantee all 3 properties at the
same time
This is a theorem. That means it is formally true, but in practice it is less severe than it seems
The system or a client can often choose CA, AP or CP according to the context, and “walk” along the chosen
edge by appropriate tuning
Partition splits happen, but they are rare events (hopefully)
Rule of thumb
Traditional relational DBMSes are CA or CP – consistency is a must, in case of a problem either bring the
cluster down or split it and expect heavy synchronization later
Many NoSQL DBMSes are AP – availability is a must, and with big clusters failures happen so better live with
it. Consistency is only eventual
The CAP theorem
Consistency
Availability Partition-
tolerance
AC CP
AP
Not
Possible

51
This is essential !
Consistency refers to the fact that all replicas of an entity, identified by a key in
the database, have the same value whatever the node queried
With many NoSQL databases, the prefered working mode is AP and all-the-time
consistency is sacrificed.
Favoring performance, updates take a little time to propagate across the cluster. When
an entity’s value has just been created or modified, there is a short span during which
the entity is not consistent.
However the cluster guarantees that it will eventually be, when replication has
occurred. This is called eventual consistency
Eventual Consistency

53
A NoSQL - originally referring to "non-SQL" for "non-relational“ - database provides a mechanism for storage
and retrieval of data that is modeled in means other than the tabular relations used in relational databases.
Such databases have existed since the late 1960s, but the name "NoSQL" was only coined in the early 21st
century, triggered by the needs of Web 2.0 companies.
NoSQL databases are increasingly used in Big Data and Real-Time Web applications.
NoSQL systems are also sometimes called "Not only SQL" to emphasize that they may support SQL-like query
languages or sit alongside SQL databases in polyglot-persistent architectures.
NoSQL / NewSQL
The fundamental idea behind NoSQL is as follows:
because of the need to distribute data (Big Data), the Web giants have abandoned the whole idea of
ACID transactions (only eventual consistency is possible)
So if we drop ACID Transactions - which we always deemed to be so fundamental - why wouldn't we
challenge all the rest - the relational model and table structure?
Wikipedia - https://en.wikipedia.org/wiki/NoSQL

54
For data fundamentally structured as tabular data et of a
manageable size, the relational model fits.
For instance:
Accounting Data
Customer information
But some other data are modeled in a much more complex way
Geospatial data
Molecular models
Some underlying notions there are fundamentally not relational
Hierarchical data
Several levels of interconnections
In addition, some data models have a high volatility and required
flexibility over time
Information available at the time of the creation of the model are
sometimes incomplete
Or there inherent structure changes over time
The relational model is not well suited for data experiencing constant structural changes
The relational model is not always well suited

55
NoSQL Database Types : 4 families
Document-oriented
(e.g. MongoDB, ElasticSearch)
Key/Value pairs
(e.g. Redis)
Graph
(e.g. Neo4J)
Column-family aka BigTable
(e.g. Cassandra)

56
NoSQL Database Types
Document-oriented (e.g. MongoDB, ES)
Key/Value pairs (e.g. Redis)
Graph (e.g. Neo4J)
Column-family aka BigTable (e.g. Cassandra)
One key has one (and only one) value
The Value type is not specified (Object value)
A Value may have different type
Issue : difficult to fit a model in this modeling pattern
Row = a set of columns
Sorted vertical storage
Operations
Query by key or set of key
Allowing query on secondary indexes
Selection of the resulted columns
The column-family model looks a bit like the relational model
For a given row, the contents of a column can thus be seen as a hash table
with arbitrary (key, value) pairs
Each row in a table is uniquely identified by a key
Documents are structured data in the form of
hierarchical trees (sub-documents)
Data can be of various types
Strings, numbers, arrays
Documents are self-supporting
It contains meta-data about the structure and the
corresponding values
Several storage formats for the document
XML, JSON, BSON
In this model, objects are documents, i.e. trees of
values
Each document has a root and attributes
Attribute values are scalars (integers, strings), lists
or other objects
Each object has a unique ID, a conventional
property whose value serves as a key
Objects are organized into collections. Objects in the
same collection don’t need to have the same schema
– there is no mandatory structure
Based on the interconnection of data (contrary to the other NoSQL
solutions which do not support relations)
Data are not only linked to nodes but also to edges (property graph)

57
Examples of NoSQL data models
Document-oriented (e.g. MongoDB)
{ ‘_id’: 123456,
'type': 'product',
'name': 'computer',
'features': {
'cpu_GHz': 3,
'ram_GB': 8,
'brand’: 'Dell'
} },
}
{ ‘_id’: 123457,
'type': 'product',
'name': 'blender',
'features': {
'rpm': 10000,
'voltage’: '220V 50 Hz'
} },
}
{ ‘_id’: 123458,
'type': 'user’,
'login': ’choupi92',
'password': 'AZnx403==',
'shopping_history': [...]
}
OBJECTS
Key/Value pairs (e.g. Redis)
obj_123456 “type=product;name=computer;cpu_GHz=3;…“
obj_123457 “type=product;name=blender;rpm=10000;…“
obj_123458 “type=user;login=choupi92;password=…“
Graph (e.g. Neo4J)
choupi92
computer
blender
hightech
kitchen
category
category
Column-family aka BigTable (e.g. Cassandra)
123456
123457
computer
blender
cpu_GHz=3 ram_GB=8
rpm=10000
brand=Dell
voltage=220V 50
Hz
name
_id
PRODUCTS
features
123458
login=
choupi92
password=A
Znx403==
08/09/13=… 10/09/13=…
_id
…
USERS
authent shopping_history

59
What is NewSQL ?
NewSQL refers to relational databases that have adopted upon some of the NoSQL genes, thus exposing
a relational data model and SQL interfaces to distributed, high volume databases
NewSQL, contrary to NoSQL, enables an application to keep
The relational view on the data
The SQL query language
Response times suited to transactional processing
Some were built from scratch (e.g. VoltDB), others are built on top of a NoSQL data store (e.g. SQLFire,
backed by GemFire, a key/value store)
The current trend is for some proven NoSQL databases, like Cassandra, to offer a thin SQL interface,
achieving the same purpose
Generally speaking, the frontier between NoSQL and NewSQL is a bit blurry… SQL compliance is often
sought for, as the key to integrating legacy SQL software (ETL, reporting) with modern No/NewSQL
databases
NewSQL?

61
Hadoop is an Open Source Platform providing
A distributed, scalable and fault tolerant storage system as a grid
Initially, a single parallelism paradigm : MapReduce to reuse the storage nodes as processing nodes
Since Hadoop V2 and YARN, other parallelization paradigms can be implemented on Hadoop
Schemaless and optimized sequential write once and read many times
Querying and processing DSL (Hive, Pig)
Hadoop ?
Hadoop is declined in
different distributions
Fondation Apache
Cloudera
HortonWorks
MapR
IBM
…
The Hadoop’s origins
Initiated by Doug Cutting, leader of Lucene
Based on the Google’s publications about their
indexing system (GFS / Map Reduce / BigTable )
Official Apache project since 2009
Hadoop was primarily intended for Big Data Analytics
Nowadays hadoop can be an infrastructure for much more
Microservices architecture (Hadoop V3)
Real-time Architectures

62
Hadoop Distribution
Hadoop overview
Distributed storage
MapReduce processing engine /
Parallel Computing Framework
Querying Orchestration
Machine learning /
Processing
IS
integration
Supervision
and
Management
Reporting
(Core)

63
Hadoop Distribution
Hadoop is an ecosystem
Hadoop
Console
Manager
(Core)

64
Hadoop Architecture
Client
Applications
Client
Applications
Client
Applications
Slave Node
HDFS Data Node
Map Reduce Task Tracker
YARN Node Manager
App Master
App
Container
R1 R2
P1
Secondary Master Node
Master Node
YARN
Resource Manager
HDFS
Name Node
Map Reduce
Job Tracker
HDFS Meta Data
YARN Meta Data
Slave Node
HDFS Data Node
YARN Node Manager
App Master
App
Container
R1 R3
P2
Slave Node
HDFS Data Node
YARN Node Manager
App Master
App
Container
R2 R3
P3

66
Vision of a data lake
With the continued growth in scope and scale of analytics applications using Hadoop and other data
sources, then the vision of an enterprise data lake can become a reality.
In a practical sense, a data lake is characterized by three key attributes:
Collect everything. A data lake contains all data, both raw sources over extended periods of time as well as
any processed data  big volumes
Dive in anywhere. A data lake enables users across multiple business units to refine, explore and enrich data
on their terms  you don’t know, a priori the analytical structures
Flexible access. A data lake enables multiple data access patterns across a shared infrastructure: batch,
interactive, online, search, in-memory and other processing engines.
As a result, a data lake delivers maximum scale and insight with the lowest possible friction and cost.
Data lake
A data lake is a system or repository of data stored in its natural/raw format
It's is usually a single store of data including raw copies of source system data, sensor data, social data
etc. and transformed data used for tasks such as reporting, visualization, advanced analytics and
machine learning.
It can include structured data from relational databases, semi-structured data (CSV, logs, XML,
JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
Wikipedia - https://en.wikipedia.org/wiki/Data_lake

67
Datalake Application Architecture
Unstructured Data Storage
Semi-structured data storage
(NoSQL)
Structured Data storage (e.g.
relational)
Interactive Queyring Analytics / Processing Flow Processing
Machine Learning
Databases Raw files Application
logs
External Data / Open
APIs
Events /
Messages
Enterprise DWH Operational
System
Query /
Reporting
APIs / Services Events /
messages
DATA
LAKE
INGESTION
PUBLICATION

69
Definition
A real time system is an event-driven system that is available, scalable and stable, able
to take decisions (actions) with a latency defined as … below the frequency of events
In a streaming architecture …
Historical data is regularly and consistently updated with live data
Live data is available to the end user
Both types or data (historical and live) are not necessarily presented consistently to the
end user
Both sets of data can have their own screens or even application
A consistent view on both sets of data would be proposed by Lambda Architecture (next topic in
this presentation)
Streaming Architectures

70
Complex Event Processing Engine
decision /
action
Transactional
Applications
BPM, ESB
Capture
Streaming Architecture
In memory states and
Calculations:
Time window,
operators, rules
Rules edition GUI
Cache / Distributed Cache
latency : 100 ms
Event/Condition/Action
Stream-based querying
multi-dimen. Analysis
…
Real-time Data GUI
Historical Data GUI
Structured
Events
Unstructured
Events
Reference Data, DWH,
Services Querying
Event
History

71
Complex Event Processing Engine
decision /
action
Transactional
Applications
BPM, ESB
Capture
Streaming Architecture
In memory states and
Calculations:
Time window,
operators, rules
Rules edition GUI
Cache / Distributed Cache
latency : 100 ms
Event/Condition/Action
Stream-based querying
multi-dimen. Analysis
…
Real-time Data GUI
Historical Data GUI
Structured
Events
Unstructured
Events
Reference Data, DWH,
Services Querying
Event
History
Stakes :
- Latency Management ( < 100 ms )
- Throughput( 10’000 msg / sec )
- Memory Consumption
- Balancing and Replication
- Fault Tolerance
- State coherence
- What about lost events ?
- Init from historical data
Stakes :
- Dynamical GUIs
- Data exploration and following axes and
criteria,
- Real-time GUI : event-driven of type « web-
push »
Stakes :
- High read performances in
respect to latency
- Good cache management
Stakes :
- High capacity
- High write performances
- High historical data querying
Performances
- Flexible Design abilities
Stakes:
- « WYSIWYG » editor, usable by business users
- « Hot » updates of rules
- Backtesting
Stakes
- Throughput (10’000 msg/sec )
- Fault tolerance : messages retry?

73
Real-Time Analytics
What if I want real-time analytics ?
• Most Data Analytics software are batch processing solutions!
• So what happens with updates occurring while a batch is running?
• What happens between two of its executions ?
Objectives:
• Take all the data into account
• Be able to answer any kind of request
• Fault-tolerance
• Robustness to evolutions, errors
• Scalability !
• Low latency for writing AND reading
PROCESSED DATA
DATA THAT CAME AFTER THE
START OF THE CURRENT BATCH
Time
More or less a few
minutes to a few hours of
data
A few minutes to a
few hours of data

74
λ (Lambda) Architecture
CONSISTENT
BATCH ANALYTICS ON
COMPREHENSIVE DATA
REAL-TIME / STREAMING
ANALYTICS ON
INCREMENTAL DATA
DATA
STREAM
STORAGE OF PRE-
COMPUTEDS RESULTS /
VIEWS OF THE DATA
STORAGE OF
INCREMENTAL RESULTS /
VIEWS OF THE DATA
To Real-Time Analytics with Near-Real-Time background statistics and models
SPEED LAYER
BATCH LAYER Final latency
< 1second
QUERYING AND
REPORTING
TOOL
AGGREGATION,
MERGING AND
CONSOLIDATION
SERVING LAYER
The batch layer is responsible for consistency and data storage on the long term
The speed layer only analyzes the required time-window
The gap between the last batch execution and the latest real-time data  only most recent data.
Both layers produce the same output (unlike usual streaming architectures)
The serving layer provides a consolidated view on both results

75
λ (Lambda) Architecture
CONSISTENT
BATCH ANALYTICS ON
COMPREHENSIVE DATA
ANALYTICS ON
INCREMENTAL DATA
DATA
STREAM
STORAGE OF PRE-
COMPUTEDS RESULTS /
VIEWS OF THE DATA
STORAGE OF
VIEWS OF THE DATA
Many solutions for all components
SPEED LAYER
BATCH LAYER
QUERYING AND
REPORTING
TOOL
AGGREGATION,
MERGING AND
CONSOLIDATION
SERVING LAYER
D3.js
HighCharts
Tableaux
Storm DRPC
Java API
Flink

76
κ (Kappa) Architecture
ANALYTICS ON
INCREMENTAL DATA
DATA
STREAM
RELOAD OF PREVIOUS
RESULTS / VIEWS OF THE
DATA
STORAGE OF
VIEWS OF THE DATA
Recent Stream Processing Technologies render the batch layer less required
UNIFIED STREAMING LAYER / TECHNOLOGY Final latency
< 1second
QUERYING AND
REPORTING
TOOL
AGGREGATION,
MERGING AND
CONSOLIDATION
SERVING LAYER
Kappa architecture is a streaming-first architecture deployment pattern
With most recent Stream Processing technologies (Kafka Streams, Flink, etc.) the interest and relevance of the batch
layer tend to diminish. The streaming layer matches computation abilities of the batch layer (ML, statistics, etc.) and
stored data as it processes it.
A batch layer would only be needed to kick-start the system on historical data (Flink can do that)

77
2.9 Big Data 2.0 & Kubernetes

78
Big Data 2.0
2012
2011 2014
Nowadays in 2021 :
With Hadoop 3, these 3 technologies converge tend to converge to the same possibilities. Hadoop 3 supports
deploying jobs as docker containers just as Mesos and K8s
Mesos and Kubernetes can use alternatives to HDFS such as Ceph, GlusterFS, Minio, (of course Amazon,
Azure, …) etc.
However, Kubernetes (and/or technologies based on Kubernetes) emerge as a market standard for the
Operational IS just as Hadoop remains a market standard for the Analytical IS

79
Kubernetes is an Open Source Platform providing
Automated software applications deployment, scaling, failover and management across cluster
of nodes
Management of application runtime components as Docker containers and application units as Pods
Multiple common services required for service location, distributed volume management, etc. (pretty
much everything one requires to deploy application on a Big Data cluster)
Kubernetes
Kubernetes is emerging as a
standard as a
Cloud Operating System
Many distributions
PKS (Pivotal Container Service)
Red-Hat OpenShift
Canonical Kubernetes
Google / AWS / Azure …
…
Kubernetes origins
Based on Google Borg, (one of) Google’s
initial cluster management system(s)
Released as Open-Source component in
Google in 2014
First official release in 2015

80
Kubernetes Architecture
Client
Applications
Client
Applications
Client
Applications
(Secondary Master Node [HA])
(Master Node)
API Server
Control
Plane
Etcd
Key – Value Store
Controller Manager
Kubctl
Port
Forward
Load
Balanc.
Controller
Node
Kubelet
App
App App App App
POD
POD
Volumes
CR1 CR2 GR1 GR3
Ceph Gluster
Kube-Proxy
Docker
Node
App App
App App App
POD
POD
Volumes
CR2 CR3 GR1 GR2
Ceph Gluster
Docker
Node
App
App App App App
POD
POD
Volumes
CR1 CR3 GR2 GR3
Ceph Gluster
Docker
cAdvisor Kubelet Kube-Proxy
cAdvisor Kubelet Kube-Proxy
cAdvisor
KubeMQ
KubeMQ
KubeMQ

81
2.10 Microservices Architecture

82
Microservice architecture – a variant of the Service-Oriented Architecture (SOA) structural style – arranges an application
as a collection of loosely-coupled services. In a microservices architecture, services are fine-grained and the protocols are
lightweight. Its characteristics are as follows:
Services in a microservices architecture (MSA) are small in size, messaging-enabled, bounded by contexts,
autonomously developed, independently deployable, decentralized and built and released with automated
processes.
Services are often processes that communicate over a network to fulfill a goal using technology-agnostic protocols such
as HTTP.
Services are organized around business capabilities.
Services can be implemented using different programming languages, databases, hardware and software environment,
depending on what fits best.
Microservices Architecture
Origins of Micro-services:
As early as 2005, Peter Rodgers introduced the
term "Micro-Web-Services" during a presentation
at the Web Services Edge conference.
The architectural style name was really adopted
in 2012
Kubernetes democratized the architectural
approach
The two big players in this field are Spring
Cloud and Kubernetes
A Microservices-based architecture has the following properties:
Lends itself to a continuous delivery software development
process. A change to a small part of the application only
requires rebuilding and redeploying only one or a small
number of services.
Adheres to principles such as fine-grained interfaces (to
independently deployable services), business-driven
development (e.g. domain-driven design).
Wikipedia - https://en.wikipedia.org/wiki/Microservices
Martin Fowler

83
Microservices Architecture
Client
Applications
Client
Applications
Client
Applications
Master Node
API
Gateway
Service Catalog / Discovery
Management / Orchestration
Node
Node Mgmt.
Execution middleware
Service Proxy
Node Node
Distributed Storage
R1 R2
Distributed Storage
R1 R3
Distributed Storage
R2 R3
Execution middleware Execution middleware
Service B
Service C
Service A
Service D
Service E
Microservices
Node Mgmt. Service Proxy Node Mgmt. Service Proxy
MQ MQ MQ
Static Content

84
Ask yourself : do you need microservices ?
Microservices are NOT Big Data ! [co-local processing]
You don’t need microservices or Kubernetes to benefit from Docker
You’re not scaling anything with synchronous calls
Don’t do microservices unless:
You need independent service-level scalability (vs. storage / processing scalability – Big Data)
You need a strong SOA - Service-Oriented Architecture
You need independent services lifecycle management
Challenges
Distributed caching vs reloading the world all over again
Not all applications are fit for asynchronous communications (WYCIWYG)
Identifying the proper granularity for services
Enterprise architecture view is too big
Application architecture view is too fine
RIA Organizer : good candidates would be : EmailService, CalendarService, ContactService, SearchService
Data consistency without distributed transactions. Applications need to be designed with this in mind.
Weighting the overall memory and performance waste
A Spring boot stack + JVM + Linux Docker base for every single service ?
HTTP calls in between layers ?
Microservices discussion

86
The Strong frontier between Operational IS and Analytical IS vanishes
NoSQL, Streaming, Lambda and Kappa architectures are increasingly overflowing to the
Operational IS and as such provide a common ground for operational processes and
analytical processes.
Historically strong on the BI Side, Hadoop (V3) fits well nowadays for needs of the
Operational IS while Kubernetes can be useful on the Analytical IS
Kubernetes (also Mesos, etc.) is a cloud Operating System, but not only (distribution,
scaling  run your cloud locally)
Don’t do Micro-Services unless you need Micro-Services … otherwise just do services :-)
Final notes …
Operational Information System BI
X

Introduction to Modern Software Architecture

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Introduction to Modern Software Architecture

Semelhante a Introduction to Modern Software Architecture (20)

Mais de Jérôme Kehrli

Mais de Jérôme Kehrli (18)

Último

Último (20)

Introduction to Modern Software Architecture

Notas do Editor