SlideShare uma empresa Scribd logo
1 de 45
Baixar para ler offline
AVOIDING BAD DATABASE
SURPRISES
Simulation and Scalability
SOME HORROR STORIES
WEB APP DOESN’T SCALE
You’ve got a brilliant app
You’ve got a brilliant cloud deployment
EC2, ALB — all the right moving parts
It still doesn't scale
S LOW
ANALYTICS DON’T SCALE
You’ve studied the data
You’ve got a model that’s hugely important
It explains things
It predicts things
But. It’s SLOW S LOW
YOU BLAMED PYTHON
Web:
NGINX,
uWSGI,
the proxy server,
the coffee shop
Analytics:
Pandas,
Scikit Learn,
Jupyter Notebook,
open office layouts
And More…
STACK OVERFLOW SAYS “PROFILE”
So you profiled
and you profiled
And…
It turns out it’s the database
HORROR MOVIE TROPE
There’s a monster
And it’s in your base
And it’s killing your dudes
It’s Your Database
KILLING THE MONSTER
Hard work
Lots of stress
Many Techniques
Indexing
Denormalization
I/O Concurrency (i.e., more devices)
Compression
CAN WE PREVENT ALL THIS?
Spoiler Alert: Yes
TO AVOID BECOMING A HORROR STORY
Simulate
Early
Often
A/K/A SYNTHETIC DATA
Why You Don’t Necessarily Need Data for Data Science
https://medium.com/capital-one-tech
HORROR MOVIE TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises
DATA IS THE ONLY THING
THAT MATTERS
Foundational Concept
SIDE-BAR ARGUMENT
UX is important, but secondary
But it’s never the kind of bottleneck app and DB servers are
You will also experiment with UX
I’m not saying ignore the UX
Data lasts forever. Data is converted and preserved.
UX is transient. Next release will have a better, more modern experience.
SIMULATE EARLY
Build your data models first
Build the nucleus of your application processing
Build a performance testing environment
With realistic volumes of data
SIMULATE OFTEN
When your data model changes
When you add features
When you acquire more sources of data
Rerun the nucleus of your application processing
With realistic volumes of data
PYTHON MAKES THIS EASY
WHAT YOU’LL NEED
A candidate data model
The nucleus of processing
RESTful API CRUD elements
Analytical Extract-Transform-Load-Map-Reduce steps
A data generator to populate the database
Measurements using time.perf_counter()
DATA MODEL
SQLAlchemy or Django ORM (or others, SQLObject, etc.)
Data Classes
Plain Old Python Objects (POPO) and JSON serialization
If we use JSON Schema validation, we can do cool stuff
THE PROBLEM: FLEXIBILITY
SQL provides minimal type information (string, decimal, date, etc.)
No ranges, enumerated values or other domain details (e.g., name vs. address)
Does provide Primary Key and Foreign Key information
Data classes provide more detailed type information
Still doesn’t include ranges or other domain details
No PK/FK help at all
A SOLUTION
Narrow type specifiations using JSON Schema
Examples to follow
class Card(Model):
"""
title: Card
description: "Simple Playing Cards"
type: object
properties:
suit:
type: string
enum: ["H", "S", "D", "C"]
rank:
type: integer
minimum: 1
maximum: 13
"""
JSON Schema Definition
In YAML Notation
HOW DOES THIS WORK?
A metaclass parses the schema YAML and builds a validator
An abstract superclass provides __init__() to validate the
document
import yaml
import json
import jsonschema
class SchemaMeta(type):
def __new__(mcs, name, bases, namespace):
# pylint: disable=protected-access
result = type.__new__(mcs, name, bases, dict(namespace))
result.SCHEMA = yaml.load(result.__doc__)
jsonschema.Draft4Validator.check_schema(result.SCHEMA)
result._validator = jsonschema.Draft4Validator(result.SCHEMA)
return result
Builds JSONSchema validator
from __doc__ string
class Model(dict, metaclass=SchemaMeta):
"""
title: Model
description: abstract superclass for Model
"""
@classmethod
def from_json(cls, document):
return cls(yaml.load(document))
@property
def json(self):
return json.dumps(self)
def __init__(self, *args, **kw):
super().__init__(*args, **kw)
if not self._validator.is_valid(self):
raise TypeError(list(self._validator.iter_errors(self)))
Validates object and raises TypeError
>>> h1 = Card.from_json('{"suit": "H", "rank": 1}')
>>> h1['suit']
'H'
>>> h1.json
'{"suit": "H", "rank": 1}'
Deserialize POPO from JSON text
Serialize POPO into JSON text
>>> d = Card.from_json('{"suit": "hearts", "rank": -12}')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 8, in from_json
File "<stdin>", line 15, in __init__
TypeError: [<ValidationError: "'hearts' is not one of
['H', 'S', 'D', 'C']">, <ValidationError: '-12 is less
than the minimum of 1'>]
Fail to deserialize invalid POPO from JSON text
WHY?
JSON Schema allows us to provide
Type (string, number, integer, boolean, array, or object)
Ranges for numerics
Enumerated values (for numbers or strings)
Format for strings (i.e. email, uri, date-time, etc.)
Text Patterns for strings (more general regular expression handling)
DATABASE SIMULATION
AHA
With JSON schema we can build simulated data
THERE ARE SIX SCHEMA TYPES
null — Always None
integer — Use const, enum, minimum, maximum constraints
number — Use const, enum, minimum, maximum constraints
string — Use const, enum, format, or pattern constraints
There are 17 defined formats to narrow the constraints
array — recursively expand items to build an array
object — recursively expand properties to build a document
class Generator:
def __init__(self, parent_schema, domains=None):
self.schema = parent_schema
def gen_null(self, schema):
return None
def gen_string(self, schema): …
def gen_integer(self, schema): …
def gen_number(self, schema): …
def gen_array(self, schema):
doc = [self.generate(schema.get('items')) for _ in range(lo, hi+1)]
return doc
def gen_object(self, schema):
doc = {
name: self.generate(subschema)
for name, subschema in schema.get('properties', {}).items()
}
return doc
def generate(self, schema=None):
schema = schema or self.schema
schema_type = schema.get('type', 'object')
method = getattr(self, f"gen_{schema_type}")
return method(schema)
Finds gen_* methods
def make_documents(model_class, count=100, domains=None):
generator = Generator(model_class.SCHEMA, domains)
docs_iter = (generator.generate() for i in range(count))
for doc in docs_iter:
print(model_class(**doc))
Or write to a file
Or load a database
NOW YOU CAN SIMULATE
Early Often
WHAT ABOUT?
More sophisticated data domains?
Name, Address, Comments, etc.
More than text. No simple format.
Primary Key and Foreign Key Relationships
HANDLING FORMATS
def gen_string(self, schema):
if 'const' in schema:
return schema['const']
elif 'enum' in schema:
return random.choice(schema['enum'])
elif 'format' in schema:
return FORMATS[schema['format']]()
else:
return "string"
TS_RANGE = (datetime.datetime(1900, 1, 1).timestamp(),
datetime.datetime(2100, 12, 31).timestamp())
FORMATS = {
'date-time': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).isoformat()
),
'date': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).date().isoformat()
),
'time': (
lambda: datetime.datetime.utcfromtimestamp(
random.randrange(*TS_RANGE)
).time().isoformat()
),
DATA DOMAINS
String format (and enum) may not be enough to characterize data
Doing Text Search or Indexing? You want text-like data
Using Names or Addresses? Random strings may not be
appropriate.
Credit card numbers? You want 16-digit strings
EXAMPLE DOMAIN: DIGITS
def digits(n):
return ''.join(random.choice('012345789') for _ in range(n))
EXAMPLE DOMAIN: NAMES
class LoremIpsum:
_phrases = [
"Lorem ipsum dolor sit amet",
"consectetur adipiscing elit”,
…etc.…
"mollis eleifend leo venenatis"
]
@staticmethod
def word():
return
random.choice(random.choice(LoremIpsum._phrases).split())
@staticmethod
def name():
return ' '.join(LoremIpsum.word() for _ in range(3)).title()
RECAP
HOW TO GET INTO TROUBLE
Faith
Have faith the best practices you read in a blog really work
Assume
Assume you understand best practices you read in a blog
Hope
Hope you will somehow avoid scaling problems
SOME TROPES
Look behind the door
Don’t run into the dark barn alone
Avoid the trackless forest after dark
Stay with your friends
Don’t dismiss the funny noises
TO DO CHECKLIST
Simulate Early and Often
Define Python Classes
Use JSON Schema to provide fine-grained definitions
With ranges, formats, enums
Build a generator to populate instances in bulk
Gather Performance Data
Profit
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

Mais conteúdo relacionado

Mais procurados

Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQuery
Katrien Verbert
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentation
mskayed
 

Mais procurados (20)

Xpath
XpathXpath
Xpath
 
Parsing XML Data
Parsing XML DataParsing XML Data
Parsing XML Data
 
XSLT and XPath - without the pain!
XSLT and XPath - without the pain!XSLT and XPath - without the pain!
XSLT and XPath - without the pain!
 
XML SAX PARSING
XML SAX PARSING XML SAX PARSING
XML SAX PARSING
 
XML and XPath details
XML and XPath detailsXML and XPath details
XML and XPath details
 
Professional-grade software design
Professional-grade software designProfessional-grade software design
Professional-grade software design
 
Xm lparsers
Xm lparsersXm lparsers
Xm lparsers
 
Querying XML: XPath and XQuery
Querying XML: XPath and XQueryQuerying XML: XPath and XQuery
Querying XML: XPath and XQuery
 
Computer project
Computer projectComputer project
Computer project
 
Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2Starting with JSON Path Expressions in Oracle 12.1.0.2
Starting with JSON Path Expressions in Oracle 12.1.0.2
 
6 xml parsing
6   xml parsing6   xml parsing
6 xml parsing
 
Xpath presentation
Xpath presentationXpath presentation
Xpath presentation
 
SAX, DOM & JDOM parsers for beginners
SAX, DOM & JDOM parsers for beginnersSAX, DOM & JDOM parsers for beginners
SAX, DOM & JDOM parsers for beginners
 
XML Support: Specifications and Development
XML Support: Specifications and DevelopmentXML Support: Specifications and Development
XML Support: Specifications and Development
 
eXtensible Markup Language (XML)
eXtensible Markup Language (XML)eXtensible Markup Language (XML)
eXtensible Markup Language (XML)
 
Php
PhpPhp
Php
 
PostgreSQL and XML
PostgreSQL and XMLPostgreSQL and XML
PostgreSQL and XML
 
Xml parsers
Xml parsersXml parsers
Xml parsers
 
Object Relational Mapping in PHP
Object Relational Mapping in PHPObject Relational Mapping in PHP
Object Relational Mapping in PHP
 
PhD Presentation
PhD PresentationPhD Presentation
PhD Presentation
 

Semelhante a Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

json.ppt download for free for college project
json.ppt download for free for college projectjson.ppt download for free for college project
json.ppt download for free for college project
AmitSharma397241
 

Semelhante a Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott (20)

98765432345671223Intro-to-PostgreSQL.ppt
98765432345671223Intro-to-PostgreSQL.ppt98765432345671223Intro-to-PostgreSQL.ppt
98765432345671223Intro-to-PostgreSQL.ppt
 
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of DatabricksDataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
DataFrame: Spark's new abstraction for data science by Reynold Xin of Databricks
 
NoSQL - "simple" web monitoring
NoSQL - "simple" web monitoringNoSQL - "simple" web monitoring
NoSQL - "simple" web monitoring
 
When to NoSQL and when to know SQL
When to NoSQL and when to know SQLWhen to NoSQL and when to know SQL
When to NoSQL and when to know SQL
 
Spark - Philly JUG
Spark  - Philly JUGSpark  - Philly JUG
Spark - Philly JUG
 
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
Simon Elliston Ball – When to NoSQL and When to Know SQL - NoSQL matters Barc...
 
Graph Database Query Languages
Graph Database Query LanguagesGraph Database Query Languages
Graph Database Query Languages
 
interenship.pptx
interenship.pptxinterenship.pptx
interenship.pptx
 
Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)Scaling Scala to the database - Stefan Zeiger (Typesafe)
Scaling Scala to the database - Stefan Zeiger (Typesafe)
 
Mindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developersMindmap: Oracle to Couchbase for developers
Mindmap: Oracle to Couchbase for developers
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx4)12th_L-1_PYTHON-PANDAS-I.pptx
4)12th_L-1_PYTHON-PANDAS-I.pptx
 
SAS Internal Training
SAS Internal TrainingSAS Internal Training
SAS Internal Training
 
Bringing OpenClinica Data into SAS
Bringing OpenClinica Data into SASBringing OpenClinica Data into SAS
Bringing OpenClinica Data into SAS
 
R environment
R environmentR environment
R environment
 
json.ppt download for free for college project
json.ppt download for free for college projectjson.ppt download for free for college project
json.ppt download for free for college project
 
SAS cheat sheet
SAS cheat sheetSAS cheat sheet
SAS cheat sheet
 
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
Kafka Summit NYC 2017 - Easy, Scalable, Fault-tolerant Stream Processing with...
 
Scala in Places API
Scala in Places APIScala in Places API
Scala in Places API
 
Learn D3.js in 90 minutes
Learn D3.js in 90 minutesLearn D3.js in 90 minutes
Learn D3.js in 90 minutes
 

Mais de PyData

Mais de PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake BolewskiThe TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 

Último

CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Último (20)

08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott

  • 3. WEB APP DOESN’T SCALE You’ve got a brilliant app You’ve got a brilliant cloud deployment EC2, ALB — all the right moving parts It still doesn't scale S LOW
  • 4. ANALYTICS DON’T SCALE You’ve studied the data You’ve got a model that’s hugely important It explains things It predicts things But. It’s SLOW S LOW
  • 5. YOU BLAMED PYTHON Web: NGINX, uWSGI, the proxy server, the coffee shop Analytics: Pandas, Scikit Learn, Jupyter Notebook, open office layouts And More…
  • 6. STACK OVERFLOW SAYS “PROFILE” So you profiled and you profiled And… It turns out it’s the database
  • 7. HORROR MOVIE TROPE There’s a monster And it’s in your base And it’s killing your dudes It’s Your Database
  • 8. KILLING THE MONSTER Hard work Lots of stress Many Techniques Indexing Denormalization I/O Concurrency (i.e., more devices) Compression
  • 9. CAN WE PREVENT ALL THIS? Spoiler Alert: Yes
  • 10. TO AVOID BECOMING A HORROR STORY Simulate Early Often
  • 11. A/K/A SYNTHETIC DATA Why You Don’t Necessarily Need Data for Data Science https://medium.com/capital-one-tech
  • 12. HORROR MOVIE TROPES Look behind the door Don’t run into the dark barn alone Avoid the trackless forest after dark Stay with your friends Don’t dismiss the funny noises
  • 13. DATA IS THE ONLY THING THAT MATTERS Foundational Concept
  • 14. SIDE-BAR ARGUMENT UX is important, but secondary But it’s never the kind of bottleneck app and DB servers are You will also experiment with UX I’m not saying ignore the UX Data lasts forever. Data is converted and preserved. UX is transient. Next release will have a better, more modern experience.
  • 15. SIMULATE EARLY Build your data models first Build the nucleus of your application processing Build a performance testing environment With realistic volumes of data
  • 16. SIMULATE OFTEN When your data model changes When you add features When you acquire more sources of data Rerun the nucleus of your application processing With realistic volumes of data
  • 18. WHAT YOU’LL NEED A candidate data model The nucleus of processing RESTful API CRUD elements Analytical Extract-Transform-Load-Map-Reduce steps A data generator to populate the database Measurements using time.perf_counter()
  • 19. DATA MODEL SQLAlchemy or Django ORM (or others, SQLObject, etc.) Data Classes Plain Old Python Objects (POPO) and JSON serialization If we use JSON Schema validation, we can do cool stuff
  • 20. THE PROBLEM: FLEXIBILITY SQL provides minimal type information (string, decimal, date, etc.) No ranges, enumerated values or other domain details (e.g., name vs. address) Does provide Primary Key and Foreign Key information Data classes provide more detailed type information Still doesn’t include ranges or other domain details No PK/FK help at all
  • 21. A SOLUTION Narrow type specifiations using JSON Schema Examples to follow
  • 22. class Card(Model): """ title: Card description: "Simple Playing Cards" type: object properties: suit: type: string enum: ["H", "S", "D", "C"] rank: type: integer minimum: 1 maximum: 13 """ JSON Schema Definition In YAML Notation
  • 23. HOW DOES THIS WORK? A metaclass parses the schema YAML and builds a validator An abstract superclass provides __init__() to validate the document
  • 24. import yaml import json import jsonschema class SchemaMeta(type): def __new__(mcs, name, bases, namespace): # pylint: disable=protected-access result = type.__new__(mcs, name, bases, dict(namespace)) result.SCHEMA = yaml.load(result.__doc__) jsonschema.Draft4Validator.check_schema(result.SCHEMA) result._validator = jsonschema.Draft4Validator(result.SCHEMA) return result Builds JSONSchema validator from __doc__ string
  • 25. class Model(dict, metaclass=SchemaMeta): """ title: Model description: abstract superclass for Model """ @classmethod def from_json(cls, document): return cls(yaml.load(document)) @property def json(self): return json.dumps(self) def __init__(self, *args, **kw): super().__init__(*args, **kw) if not self._validator.is_valid(self): raise TypeError(list(self._validator.iter_errors(self))) Validates object and raises TypeError
  • 26. >>> h1 = Card.from_json('{"suit": "H", "rank": 1}') >>> h1['suit'] 'H' >>> h1.json '{"suit": "H", "rank": 1}' Deserialize POPO from JSON text Serialize POPO into JSON text
  • 27. >>> d = Card.from_json('{"suit": "hearts", "rank": -12}') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 8, in from_json File "<stdin>", line 15, in __init__ TypeError: [<ValidationError: "'hearts' is not one of ['H', 'S', 'D', 'C']">, <ValidationError: '-12 is less than the minimum of 1'>] Fail to deserialize invalid POPO from JSON text
  • 28. WHY? JSON Schema allows us to provide Type (string, number, integer, boolean, array, or object) Ranges for numerics Enumerated values (for numbers or strings) Format for strings (i.e. email, uri, date-time, etc.) Text Patterns for strings (more general regular expression handling)
  • 30. AHA With JSON schema we can build simulated data
  • 31. THERE ARE SIX SCHEMA TYPES null — Always None integer — Use const, enum, minimum, maximum constraints number — Use const, enum, minimum, maximum constraints string — Use const, enum, format, or pattern constraints There are 17 defined formats to narrow the constraints array — recursively expand items to build an array object — recursively expand properties to build a document
  • 32. class Generator: def __init__(self, parent_schema, domains=None): self.schema = parent_schema def gen_null(self, schema): return None def gen_string(self, schema): … def gen_integer(self, schema): … def gen_number(self, schema): … def gen_array(self, schema): doc = [self.generate(schema.get('items')) for _ in range(lo, hi+1)] return doc def gen_object(self, schema): doc = { name: self.generate(subschema) for name, subschema in schema.get('properties', {}).items() } return doc def generate(self, schema=None): schema = schema or self.schema schema_type = schema.get('type', 'object') method = getattr(self, f"gen_{schema_type}") return method(schema) Finds gen_* methods
  • 33. def make_documents(model_class, count=100, domains=None): generator = Generator(model_class.SCHEMA, domains) docs_iter = (generator.generate() for i in range(count)) for doc in docs_iter: print(model_class(**doc)) Or write to a file Or load a database
  • 34. NOW YOU CAN SIMULATE Early Often
  • 35. WHAT ABOUT? More sophisticated data domains? Name, Address, Comments, etc. More than text. No simple format. Primary Key and Foreign Key Relationships
  • 36. HANDLING FORMATS def gen_string(self, schema): if 'const' in schema: return schema['const'] elif 'enum' in schema: return random.choice(schema['enum']) elif 'format' in schema: return FORMATS[schema['format']]() else: return "string"
  • 37. TS_RANGE = (datetime.datetime(1900, 1, 1).timestamp(), datetime.datetime(2100, 12, 31).timestamp()) FORMATS = { 'date-time': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).isoformat() ), 'date': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).date().isoformat() ), 'time': ( lambda: datetime.datetime.utcfromtimestamp( random.randrange(*TS_RANGE) ).time().isoformat() ),
  • 38. DATA DOMAINS String format (and enum) may not be enough to characterize data Doing Text Search or Indexing? You want text-like data Using Names or Addresses? Random strings may not be appropriate. Credit card numbers? You want 16-digit strings
  • 39. EXAMPLE DOMAIN: DIGITS def digits(n): return ''.join(random.choice('012345789') for _ in range(n))
  • 40. EXAMPLE DOMAIN: NAMES class LoremIpsum: _phrases = [ "Lorem ipsum dolor sit amet", "consectetur adipiscing elit”, …etc.… "mollis eleifend leo venenatis" ] @staticmethod def word(): return random.choice(random.choice(LoremIpsum._phrases).split()) @staticmethod def name(): return ' '.join(LoremIpsum.word() for _ in range(3)).title()
  • 41. RECAP
  • 42. HOW TO GET INTO TROUBLE Faith Have faith the best practices you read in a blog really work Assume Assume you understand best practices you read in a blog Hope Hope you will somehow avoid scaling problems
  • 43. SOME TROPES Look behind the door Don’t run into the dark barn alone Avoid the trackless forest after dark Stay with your friends Don’t dismiss the funny noises
  • 44. TO DO CHECKLIST Simulate Early and Often Define Python Classes Use JSON Schema to provide fine-grained definitions With ranges, formats, enums Build a generator to populate instances in bulk Gather Performance Data Profit