Abstract: Unscientific guess: 90% of the companies out there neither have the data amounts nor the real-time requirements that justify maintaining a big data streaming infrastructure. Still, these companies also need to integrate data in order to improve their products and processes. Some of them then still use Spark to handle a few GB of data, but for the vast majority, running SQL scripts in simple relational databases does the trick. In this talk, I will give some recommendations and best practices for setting up data integration infrastructure with open source technologies. I will explain why PostgreSQL is a perfect fit for building data warehouses with up to a few TB of data. And I will argue that Airflow is probably not the best tool for orchestrating the execution of SQL scripts.
Presented at the Data Council Meetup Kickoff in Berlin
3. !3
Your are not Google
@martin_loetzsch
You are also not Amazon, LinkedIn, Uber etc.
https://www.slideshare.net/zhenxiao/real-time-analytics-at-uber-
strata-data-2019
https://blog.bradfieldcs.com/you-are-not-google-84912cf44afb
5. All the data of the company in one place
Data is
the single source of truth
easy to access
documented
embedded into the organisation
Integration of different domains
Main challenges
Consistency & correctness
Changeability
Complexity
Transparency
!5
Data warehouse = integrated data
@martin_loetzsch
Nowadays required for running a business
application
databases
events
csv files
apis
reporting
crm
marketing
…
search
pricing
DWH orders
users
products
price
histories
emails
clicks
…
…
operation
events
8. Warehouse adoption from 2016 to today
Based on three years of segment.io customer adoption
(https://twitter.com/segment/status/1125891660800413697)
BigQuery
When you have terabytes of stuff
For raw event storage & processing
Snowflake
When you are not able to run a database yourself
When you can’t write efficient queries
Redshift
Can’t recommend for ETL
When OLAP performance becomes a problem
ClickHouse
For very fast OLAP
!8
If in doubt, use PostgreSQL
@martin_loetzsch
Boring, battle-tested work horse
10. Data pipelines as code
SQL files, python & shell scripts
Structure & content of data warehouse are result of running code
Easy to debug & inspect
Develop locally, test on staging system, then deploy to production
Functional data engineering
Reproducability, idempotency, (immutability).
Running a pipeline on the same inputs will always produce the same
results. No side effects.
https://medium.com/@maximebeauchemin/functional-data-engineering-
a-modern-paradigm-for-batch-data-processing-2327ec32c42a
!10
Make changing and testing things easy
@martin_loetzsch
Apply standard software engineering best practices
11. from execute_sql import execute_sql_file, execute_sql_statment
# create utility functions in PostgreSQL
execute_sql_statment('DROP SCHEMA IF EXISTS util CASCADE; CREATE
SCHEMA util;')
execute_sql_statment('CREATE EXTENSION IF NOT EXISTS pg_trgm;')
execute_sql_file('utils/indexes_and_constraints.sql',
echo_queries=False)
execute_sql_file('utils/schema_switching.sql', echo_queries=False)
execute_sql_file('utils/consistency_checks.sql',
echo_queries=False)
# create tmp and dim_next schema
execute_sql_statment('DROP SCHEMA IF EXISTS c_temp CASCADE; CREATE
SCHEMA c_temp;')
execute_sql_statment('DROP SCHEMA IF EXISTS dim_next CASCADE;
CREATE SCHEMA dim_next;')
# preprocess / combine entities
execute_sql_file('contacts/preprocess_contacts_dates.sql')
execute_sql_file('contacts/preprocess_contacts.sql')
execute_sql_file('organisations/preprocess_organizations.sql')
execute_sql_file('products/preprocess_products.sql')
execute_sql_file('deals/preprocess_dealflow.sql')
execute_sql_file('deals/preprocess_deal.sql')
execute_sql_file('marketing/preprocess_website_visitors.sql')
execute_sql_file('marketing/
create_contacts_performance_attribution.sql')
# create reporting schema, establish foreign keys, transfer
aggregates
execute_sql_file('contacts/transform_contacts.sql')
execute_sql_file('organisations/transform_organizations.sql')
execute_sql_file('organisations/create_org_data_set.sql')
execute_sql_file('deals/transform_deal.sql')
execute_sql_file('deals/flatten_deal_fact.sql')
execute_sql_file('deals/create_deal_data_set.sql')
execute_sql_file('marketing/transform_marketing_performance.sql')
execute_sql_file('targets/preprocess_target.sql')
execute_sql_file('targets/transform_target.sql')
execute_sql_file('constrain_tables.sql')
# Consistency checks
execute_sql_file('consistency_checks.sql')
# replace the current version of the reporting schema with the next
execute_sql_statment(
"SELECT util.replace_schema('dim', 'dim_next')")
!11
Simple scripts
@martin_loetzsch
SQL, python & bash
12. !12
Schedule via Jenkins, Rundeck, etc.
@martin_loetzsch
Such a setup usually enough for a lot of companies
13. Works well
For simple incremental transformations on large partitioned data sets
For moving data
Does not work so well
When you have complex business logic/ lot of pipeline branching
When you can’t partition your data well
Weird workarounds needed
For finding out when something went wrong
For running something again (need to manually delete task instances)
For deploying things while pipelines are running
For dynamic dags (decide at run-time what to run)
Usually: BashOperator for everything
High devops complexity
!13
When you have actual big data: Airflow
@martin_loetzsch
Caveat: I know only very few teams that are very productive in using it
14. Jinja-templated queries, DAGs created from dependencies
with source as (
select * from {{var('ticket_tags_table')}}
),
renamed as (
select
--ids
{{dbt_utils.surrogate_key (
‘_sdc_level_0_id', '_sdc_source_key_id')}}
as tag_id,
_sdc_source_key_id as ticket_id,
--fields
nullif(lower(value), '') as ticket_tag_value
from source
)
select * from renamed
Additional meta data in YAML files
version: 2
models:
- name: zendesk_organizations
columns:
- name: organization_id
tests:
- unique
- not_null
- name: zendesk_tickets
columns:
- name: ticket_id
tests:
- unique
- not_null
- name: zendesk_users
columns:
- name: user_id
tests:
- unique
- not_null
!14
Decent: DBT (data build tool)
@martin_loetzsch
Prerequisite: Data already in DB, everything can be done in SQL
18. Example pipeline
pipeline = Pipeline(
id="pypi",
description="Builds a PyPI downloads cube using the public ..”)
# ..
pipeline.add(
Task(id=“transform_python_version", description=‘..’,
commands=[
ExecuteSQL(sql_file_name="transform_python_version.sql")
]),
upstreams=['read_download_counts'])
pipeline.add(
ParallelExecuteSQL(
id=“transform_download_counts", description=“..”,
sql_statement=“SELECT pypi_tmp.insert_download_counts(@chunk@::SMALLINT);",
parameter_function=etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download_counts.sql")
]),
upstreams=["preprocess_project_version", "transform_installer",
"transform_python_version"])
!18
ETL pipelines as code
@martin_loetzsch
Pipeline = list of tasks with dependencies between them. Task = list of commands
19. Target of computation
CREATE TABLE m_dim_next.region (
region_id SMALLINT PRIMARY KEY,
region_name TEXT NOT NULL UNIQUE,
country_id SMALLINT NOT NULL,
country_name TEXT NOT NULL,
_region_name TEXT NOT NULL
);
Do computation and store result in table
WITH raw_region
AS (SELECT DISTINCT
country,
region
FROM m_data.ga_session
ORDER BY country, region)
INSERT INTO m_dim_next.region
SELECT
row_number()
OVER (ORDER BY country, region ) AS region_id,
CASE WHEN (SELECT count(DISTINCT country)
FROM raw_region r2
WHERE r2.region = r1.region) > 1
THEN region || ' / ' || country
ELSE region END AS region_name,
dense_rank() OVER (ORDER BY country) AS country_id,
country AS country_name,
region AS _region_name
FROM raw_region r1;
INSERT INTO m_dim_next.region
VALUES (-1, 'Unknown', -1, 'Unknown', 'Unknown');
Speedup subsequent transformations
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['_region_name', ‘country_name',
'region_id']);
SELECT util.add_index(
'm_dim_next', 'region',
column_names := ARRAY ['country_id', 'region_id']);
ANALYZE m_dim_next.region;
!19
PostgreSQL as a data processing engine
@martin_loetzsch
Leave data in DB, Tables as (intermediate) results of processing steps
20. Execute query
ExecuteSQL(sql_file_name=“preprocess-ad.sql")
cat app/data_integration/pipelines/facebook/preprocess-ad.sql
| PGTZ=Europe/Berlin PGOPTIONS=—client-min-messages=warning
psql --username=mloetzsch --host=localhost --echo-all
-—no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
Read file
ReadFile(file_name=“country_iso_code.csv",
compression=Compression.NONE,
target_table="os_data.country_iso_code",
mapper_script_file_name=“read-country-iso-codes.py",
delimiter_char=“;")
cat "dwh-data/country_iso_code.csv"
| .venv/bin/python3.6 "app/data_integration/pipelines/load_data/
read-country-iso-codes.py"
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning
psql --username=mloetzsch --host=localhost --echo-all
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
--command="COPY os_data.country_iso_code FROM STDIN WITH CSV
DELIMITER AS ';'"
Copy from other databases
Copy(sql_file_name="pdm/load-product.sql", source_db_alias=“pdm",
target_table=“os_data.product",
replace={"@@db@@": "K24Pdm", "@@dbschema@@": “ps",
"@@client@@": "kfzteile24 GmbH"})
cat app/data_integration/pipelines/load_data/pdm/load-product.sql
| sed "s/@@db@@/K24Pdm/g;s/@@dbschema@@/ps/g;s/@@client@@/
kfzteile24 GmbH/g"
| sed 's/$/$/g;s/$/$/g' | (cat && echo ';')
| (cat && echo ';
go')
| sqsh -U ***** -P ******* -S ******* -D K24Pdm -m csv
| PGTZ=Europe/Berlin PGOPTIONS=--client-min-messages=warning
psql --username=mloetzsch --host=localhost --echo-all
--no-psqlrc --set ON_ERROR_STOP=on kfz_dwh_etl
--command="COPY os_data.product FROM STDIN WITH CSV HEADER"
!20
Shell commands as interface to data & DBs
@martin_loetzsch
Nothing is faster than a unix pipe
21. Read a set of files
pipeline.add(
ParallelReadFile(
id="read_download",
description="Loads PyPI downloads from pre_downloaded csv
files",
file_pattern="*/*/*/pypi/downloads-v1.csv.gz",
read_mode=ReadMode.ONLY_NEW,
compression=Compression.GZIP,
target_table="pypi_data.download",
delimiter_char="t", skip_header=True, csv_format=True,
file_dependencies=read_download_file_dependencies,
date_regex="^(?P<year>d{4})/(?P<month>d{2})/(?
P<day>d{2})/",
partition_target_table_by_day_id=True,
timezone="UTC",
commands_before=[
ExecuteSQL(
sql_file_name="create_download_data_table.sql",
file_dependencies=read_download_file_dependencies)
]))
Split large joins into chunks
pipeline.add(
ParallelExecuteSQL(
id="transform_download",
description="Maps downloads to their dimensions",
sql_statement="SELECT
pypi_tmp.insert_download(@chunk@::SMALLINT);”,
parameter_function=
etl_tools.utils.chunk_parameter_function,
parameter_placeholders=["@chunk@"],
commands_before=[
ExecuteSQL(sql_file_name="transform_download.sql")
]),
upstreams=["preprocess_project_version",
"transform_installer"])
!21
Incremental & parallel processing
@martin_loetzsch
You can’t join all clicks with all customers at once
23. Always: At least two out of
VPN
IP restriction
SSH tunnel
Don’t rely on passwords, they tend to be shared and are not
changed when someone leaves the company
SSH keys
Single sign-on
!23
Please don’t put your data on the internet
@martin_loetzsch
Also: GDPR
24. Authenticates each incoming request against auth provider We run it in front all web interfaces
Including Tableau, Jenkins, Metabase
Many auth providers: Google, Azure AD, etc.
Much better than passwords!
!24
SSO with oauth2_proxy
@martin_loetzsch
https://github.com/pusher/oauth2_proxy
27. Ad Blockers
Blind Spots
Not all user interactions happen online (returns, call center requests)
Some data can not be leaked to pixel: segments, prices, backend logic
Solution: server side tracking
Collect events on the server rather on the client
Use cases & advantages
Ground truth: correct metrics recorded in marketing tools
Price: provide a cheaper alternative to GA Premium / Segment etc.
GDPR compliance: own data, avoid black boxes
Product analytics: Understand bottlenecks in user journey
Unified user user journey: combine events from multiple touchpoints
Site speed: Front-ends are not slowed down by analytics pixels
SEO: Measure site indexing by search engines
!27
Pixel based tracking is dead
@martin_loetzsch
Server side tracking: surprisingly easy
28. !28
Works: Kinesis + AWS Lamda + S3 + BigQuery
@martin_loetzsch
Technologies don’t matter: would also work with Google cloud & Azure
User Project A
Website
AWS Kinesis
Firehose
AWS Lambda
Amazon S3
Google
BigQuery
29. Everything the backend knows about the user, the context, the content
{
"visitor_id": "10d1fa9c9fd39cff44c88bd551b1ab4dfe92b3da",
"session_id": "9tv1phlqkl5kchajmb9k2j2434",
"timestamp": "2018-12-16T16:03:04+00:00",
"ip": "92.195.48.163",
"url": "https://www.project-a.com/en/career/jobs/data-engineer-data-scientist-m-f-d-4072020002?gh_jid=4082751002&gh_src=9fcd30182&&utm_medium=social&u
"host": "www.project-a.com",
"path": [
"en",
"career",
"jobs",
"data-engineer-data-scientist-m-f-d-4072020002"
],
"query": {
"gh_jid": "4082751002",
"gh_src": "9fcd30182",
"utm_medium": "social",
"utm_source": "linkedin",
"utm_campaign": "buffer"
},
"referrer": null,
"language": "en",
"ua": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36"
}
!29
Backend collects user interaction data
@martin_loetzsch
And sends it asynchronously to a queue
30. Kirby tracking plugin for Project A web site
<?php
require 'vendor/autoload.php';
// this cookie is set when not present
$cookieName = 'visitor';
// retrieve visitor id from cookie
$visitorId = array_key_exists($cookieName, $_COOKIE) ?
$_COOKIE[$cookieName] : null;
if (!$visitorId) {
// visitor cookie not set. Use session id as visitor ID
$visitorId = sha1(uniqid(s::id(), true));
setcookie($cookieName, $visitorId,
time() + (2 * 365 * 24 * 60 * 60), '/',
'project-a.com', false, true);
}
$request = kirby()->request();
// the payload to log
$event = [
'visitor_id' => $visitorId,
'session_id' => s::id(),
'timestamp' => date('c'),
'ip' => $request->__debuginfo()[‘ip'],
'url' => $request->url(),
'host' => $_SERVER['SERVER_NAME'],
'path' => $request->path()->toArray(),
'query' => $request->query()->toArray(),
'referrer' => visitor::referer(),
'language' => visitor::acceptedLanguageCode(),
'ua' => visitor::userAgent()
];
$firehoseClient = new AwsFirehoseFirehoseClient([
// secrets
]);
// publish message to firehose delivery stream
$promise = $firehoseClient->putRecordAsync([
'DeliveryStreamName' => ‘kinesis-firehose-stream-123',
'Record' => ['Data' => json_encode($event)]
]);
register_shutdown_function(function () use ($promise) {
$promise->wait();
});
!30
Custom implementation for each web site
@martin_loetzsch
Good news: there is usually already code that populates the data layer
32. Lambda function part I
import base64
import functools
import json
import geoip2.database
from google.cloud import bigquery
from ua_parser import user_agent_parser
@functools.lru_cache(maxsize=None)
def get_geo_db():
return geoip2.database.Reader('./GeoLite2-City_20181002/
GeoLite2-City.mmdb')
def extract_geo_data(ip):
"""Does a geo lookup for an IP address"""
response = get_geo_db().city(ip)
return {
'country_iso_code': response.country.iso_code,
'country_name': response.country.name,
'subdivisions_iso_code':
response.subdivisions.most_specific.iso_code,
'subdivisions_name':
response.subdivisions.most_specific.name,
'city_name': response.city.name
}
def parse_user_agent(user_agent_string):
"""Extracts browser, OS and device information from an user
agent"""
result = user_agent_parser.Parse(user_agent_string)
return {
'browser_family': result['user_agent']['family'],
'browser_version': result['user_agent']['major'],
'os_family': result['os']['family'],
'os_version': result['os']['major'],
'device_brand': result['device']['brand'],
'device_model': result['device']['model']
}
!32
Geo Lookup & device detection
@martin_loetzsch
Solved problem, very good open source libraries exist
33. Lambda function part II
def lambda_handler(event, context):
lambda_output_records = []
rows_for_biguery = []
bq_client = bigquery.Client.from_service_account_json(
‘big-query-credendials.json’
)
for record in event['records']:
message = json.loads(base64.b64decode(record['data']))
# extract browser, device, os
if message['ua']:
message.update(parse_user_agent(message['ua']))
del message['ua']
# geo lookup for ip address
message.update(extract_geo_data(message['ip']))
del message['ip']
# update get parameters
if message['query']:
message['query'] = [
{'param': param, 'value': value}
for param, value in message[‘query’].items()
]
rows_for_biguery.append(message)
lambda_output_records.append({
'recordId': record['recordId'],
'result': 'Ok',
‘data': base64.b64encode(
json.dumps(message).encode('utf-8')).decode('utf-8')
})
errors = bq_client.insert_rows(
bq_client.get_table(
bq_client.dataset(‘server_side_tracking')
.table(‘project_a_website_events’)
),
rows_for_biguery)
if errors != []:
raise Exception(json.dumps(errors))
return {
"statusCode": 200,
“records": lambda_output_records
}
!33
Send events to BigQuery & S3
@martin_loetzsch
Also possible: send to Google Analytics, Mixpanel, Segment, Heap etc.
36. It’s easy to make mistakes during ETL
DROP SCHEMA IF EXISTS s CASCADE; CREATE SCHEMA s;
CREATE TABLE s.city (
city_id SMALLINT,
city_name TEXT,
country_name TEXT
);
INSERT INTO s.city VALUES
(1, 'Berlin', 'Germany'),
(2, 'Budapest', 'Hungary');
CREATE TABLE s.customer (
customer_id BIGINT,
city_fk SMALLINT
);
INSERT INTO s.customer VALUES
(1, 1),
(1, 2),
(2, 3);
Customers per country?
SELECT
country_name,
count(*) AS number_of_customers
FROM s.customer JOIN s.city
ON customer.city_fk = s.city.city_id
GROUP BY country_name;
Back up all assumptions about data by constraints
ALTER TABLE s.city ADD PRIMARY KEY (city_id);
ALTER TABLE s.city ADD UNIQUE (city_name);
ALTER TABLE s.city ADD UNIQUE (city_name, country_name);
ALTER TABLE s.customer ADD PRIMARY KEY (customer_id);
[23505] ERROR: could not create unique index "customer_pkey"
Detail: Key (customer_id)=(1) is duplicated.
ALTER TABLE s.customer ADD FOREIGN KEY (city_fk)
REFERENCES s.city (city_id);
[23503] ERROR: insert or update on table "customer" violates
foreign key constraint "customer_city_fk_fkey"
Detail: Key (city_fk)=(3) is not present in table "city"
!36
Referential consistency
@martin_loetzsch
Only very little overhead, will save your ass
37. 10/18/2017 2017-10-18-dwh-schema-pav.svg
customer
customer_id
first_order_fk
favourite_product_fk
lifetime_revenue
product
product_id
revenue_last_6_months
order
order_id
processed_order_id
customer_fk
product_fk
revenue
Never repeat “business logic”
SELECT sum(total_price) AS revenue
FROM os_data.order
WHERE status IN ('pending', 'accepted', 'completed',
'proposal_for_change');
SELECT CASE WHEN (status <> 'started'
AND payment_status = 'authorised'
AND order_type <> 'backend')
THEN o.order_id END AS processed_order_fk
FROM os_data.order;
SELECT (last_status = 'pending') :: INTEGER AS is_unprocessed
FROM os_data.order;
Refactor pipeline
Create separate task that computes everything we know about an order
Usually difficult in real life
Load → preprocess → transform → flatten-fact
!37
Computational consistency
@martin_loetzsch
Requires discipline
load-product load-order load-customer
preprocess-product preprocess-order preprocess-customer
transform-product transform-order transform-customer
flatten-product-fact flatten-order-fact flatten-customer-fact
38. CREATE FUNCTION m_tmp.normalize_utm_source(TEXT)
RETURNS TEXT AS $$
SELECT
CASE
WHEN $1 LIKE '%.%' THEN lower($1)
WHEN $1 = '(direct)' THEN 'Direct'
WHEN $1 LIKE 'Untracked%' OR $1 LIKE '(%)'
THEN $1
ELSE initcap($1)
END;
$$ LANGUAGE SQL IMMUTABLE;
CREATE FUNCTION util.norm_phone_number(phone_number TEXT)
RETURNS TEXT AS $$
BEGIN
phone_number := TRIM(phone_number);
phone_number := regexp_replace(phone_number, '(0)', '');
phone_number
:= regexp_replace(phone_number, '[^[:digit:]]', '', 'g');
phone_number
:= regexp_replace(phone_number, '^(+49|0049|49)', '0');
phone_number := regexp_replace(phone_number, '^(00)', '');
phone_number := COALESCE(phone_number, '');
RETURN phone_number;
END;
$$ LANGUAGE PLPGSQL IMMUTABLE;
CREATE FUNCTION m_tmp.compute_ad_id(id BIGINT, api m_tmp.API)
RETURNS BIGINT AS $$
-- creates a collision free ad id from an id in a source system
SELECT ((CASE api
WHEN 'adwords' THEN 1
WHEN 'bing' THEN 2
WHEN 'criteo' THEN 3
WHEN 'facebook' THEN 4
WHEN 'backend' THEN 5
END) * 10 ^ 18) :: BIGINT + id
$$ LANGUAGE SQL IMMUTABLE;
CREATE FUNCTION pv.date_to_supplier_period_start(INTEGER)
RETURNS INTEGER AS $$
-- this maps all dates to either a integer which is included
-- in lieferantenrabatt.period_start or
-- null (meaning we don't have a lieferantenrabatt for it)
SELECT
CASE
WHEN $1 >= 20170501 THEN 20170501
WHEN $1 >= 20151231 THEN 20151231
ELSE 20151231
END;
$$ LANGUAGE SQL IMMUTABLE;
!38
When not possible: use functions
@martin_loetzsch
Almost no performance overhead
39. Check for “lost” rows
SELECT util.assert_equal(
'The order items fact table should contain all order items',
'SELECT count(*) FROM os_dim.order_item',
'SELECT count(*) FROM os_dim.order_items_fact');
Check consistency across cubes / domains
SELECT util.assert_almost_equal(
'The number of first orders should be the same in '
|| 'orders and marketing touchpoints cube',
'SELECT count(net_order_id)
FROM os_dim.order
WHERE _net_order_rank = 1;',
'SELECT (SELECT sum(number_of_first_net_orders)
FROM m_dim.acquisition_performance)
/ (SELECT count(*)
FROM m_dim.performance_attribution_model)',
1.0
);
Check completeness of source data
SELECT util.assert_not_found(
'Each adwords campaign must have the attribute "Channel"',
'SELECT DISTINCT campaign_name, account_name
FROM aw_tmp.ad
JOIN aw_dim.ad_performance ON ad_fk = ad_id
WHERE attributes->>''Channel'' IS NULL
AND impressions > 0
AND _date > now() - INTERVAL ''30 days''');
Check correctness of redistribution transformations
SELECT util.assert_almost_equal_relative(
'The cost of non-converting touchpoints must match the'
|| 'redistributed customer acquisition and reactivation cost',
'SELECT sum(cost)
FROM m_tmp.cost_of_non_converting_touchpoints;',
'SELECT
(SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_acquisition_cost)
+ (SELECT sum(cost_per_touchpoint * number_of_touchpoints)
FROM m_tmp.redistributed_customer_reactivation_cost);',
0.00001);
!39
Data consistency checks
@martin_loetzsch
Makes changing things easy
40. Execute queries and compare results
CREATE FUNCTION util.assert(description TEXT, query TEXT)
RETURNS BOOLEAN AS $$
DECLARE
succeeded BOOLEAN;
BEGIN
EXECUTE query INTO succeeded;
IF NOT succeeded THEN RAISE EXCEPTION 'assertion failed:
# % #
%', description, query;
END IF;
RETURN succeeded;
END
$$ LANGUAGE 'plpgsql';
CREATE FUNCTION util.assert_almost_equal_relative(
description TEXT, query1 TEXT,
query2 TEXT, percentage DECIMAL)
RETURNS BOOLEAN AS $$
DECLARE
result1 NUMERIC;
result2 NUMERIC;
succeeded BOOLEAN;
BEGIN
EXECUTE query1 INTO result1;
EXECUTE query2 INTO result2;
EXECUTE 'SELECT abs(' || result2 || ' - ' || result1 || ') / '
|| result1 || ' < ' || percentage INTO succeeded;
IF NOT succeeded THEN RAISE WARNING '%
assertion failed: abs(% - %) / % < %
%: (%)
%: (%)', description, result2, result1, result1, percentage,
result1, query1, result2, query2;
END IF;
RETURN succeeded;
END
$$ LANGUAGE 'plpgsql';
!40
Consistency check functions
@martin_loetzsch
Also: assert_not_found, assert_equal_table, assert_smaller_than_or_equal