Slides for the webinar by Adam Brown, Mux and Robert Hodges, Altinity
Mux.com enables content providers to stream video to vast audiences while maintaining pin-point control over performance. Join us as Adam Brown, a co-founder of Mux and video expert, explains the role that ClickHouse plays in content delivery. We'll start with an overview of the Mux platform and the importance of real-time feedback. We'll then discuss key features of ClickHouse that enable it to ingest live session data and provide real-time dashboards to Mux users. Finally, we'll talk a bit about the Mux journey to ClickHouse and lessons learned along the way about how data enables content delivery networks.
Bio:
Adam Brown is a co-founder and Head of Technology and Architecture at Mux. He has worked extensively with video, including experience at Zencoder where he and other Mux co-founders pioneered cloud-based video encoding for users like Amazon and the NFL.
Robert Hodges is CEO of Altinity with experience in databases dating back to 1983. ClickHouse is the 20th DBMS he has worked on. His previous database company exited successfully to VMware.
Big Data and Beautiful Video: How ClickHouse enables Mux to Deliver Content at Scale
1.
2. Presenter Bios
Adam Brown -- Head of
Technology and Architecture
Co-founder of Mux with extensive
experience in video encoding and
delivery going back to Zencoder
Robert Hodges - Altinity CEO
30+ years on DBMS plus
virtualization and security.
ClickHouse is DBMS #20
3. Company Intros
www.altinity.com
Leading software and services
provider for ClickHouse
Major committer and community
sponsor in US and Western Europe
mux.com
Mux Video is an API-first platform,
powered by data and designed by
video experts to make beautiful video
possible for every development team.
17. Introducing the MergeTree table engine
CREATE TABLE ontime (
Year UInt16,
Quarter UInt8,
Month UInt8,
...
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(FlightDate)
ORDER BY (Carrier, FlightDate)
Table engine type
How to break data
into parts
How to index and
sort data in each part
18. MergeTree layout within a single part
/var/lib/clickhouse/data/airline/ontime_reordered
2017-01-01 AA
2017-01-01 EV
2018-01-01 UA
2018-01-02 AA
...
primary.idx
||||
.mrk2 .bin
20170701_20170731_355_355_2/
(FlightDate, Carrier...) ActualElapsedTime Airline AirlineID...
||||
.mrk2 .bin
||||
.mrk2 .bin
Granule
Compressed
BlockMark
19. Compression and codecs are configurable
CREATE TABLE test_codecs (
a_lz4 String CODEC(LZ4),
a_zstd String DEFAULT a_lz4 CODEC(ZSTD),
a_lc_lz4 LowCardinality(String) DEFAULT a_lz4 CODEC(LZ4),
a_lc_zstd LowCardinality(String) DEFAULT a_lz4 CODEC(ZSTD)
)
Engine = MergeTree
PARTITION BY tuple() ORDER BY tuple();
20. Effect on storage size is dramatic
20.84% 12.28%
10.61%
10.65%
7.89%
21. How do you get those nice numbers?
SELECT name AS col,
sum(data_uncompressed_bytes) AS uncompressed,
sum(data_compressed_bytes) AS compressed,
round((compressed / uncompressed) * 100., 2) AS pct,
bar(pct, 0., 100., 20) AS bar_graph
FROM system.columns
WHERE (database = currentDatabase()) AND (table = 'test_codecs')
GROUP BY name ORDER BY pct DESC, name ASC
┌─col───────┬─uncompressed─┬─compressed─┬────pct─┬─bar_graph────────────┐
│ a_lc_lz4 │ 200446439 │ 201154656 │ 100.35 │ ████████████████████ │
│ a_lc_zstd │ 200446439 │ 148996404 │ 74.33 │ ██████████████▋ │
│ a_lz4 │ 1889003550 │ 393713202 │ 20.84 │ ████ │
│ a_zstd │ 1889003550 │ 231975292 │ 12.28 │ ██▍ │
└───────────┴──────────────┴────────────┴────────┴──────────────────────┘
22. Materialized views restructure/reduce data
cpu
Table
Ingest
All CPU measurements Last measurement by CPU
cpu_last_point_agg
SummingMergeTree
(Trigger)
Compressed: ~0.0009%
Uncompressed: ~0.002%
cpu_last_point_mv
Materialized View
CREATE MATERIALIZED VIEW cpu_last_point_mv
TO cpu_last_point_agg
AS SELECT
cpu_id,
maxState(created_at) AS max_created_at,
. . .
argMaxState(usage_idle, created_at) AS usage_idle
FROM cpu GROUP BY cpu_id
23. Pattern: TTLs + downsampled views
CREATE TABLE web_visits (
time DateTime CODEC(DoubleDelta,
ZSTD),
session_id UInt64,
visitor_id UInt64, . . .
) Engine = MergeTree
PARTITION BY toDate(time)
ORDER BY (session_id, time)
TTL time + INTERVAL 7 DAY
web_visitsIngest hourly_uniq_visits
hourly_sessions
Ephemeral source data Long-term aggregates
24. SET allow_experimental_data_skipping_indices=1;
ALTER TABLE ontime ADD INDEX
dest_name Dest TYPE ngrambf_v1(3, 512, 2, 0) GRANULARITY 1
ALTER TABLE ontime ADD INDEX
cname Carrier TYPE set(0) GRANULARITY 1
OPTIMIZE TABLE ontime FINAL
-- OR, in current releases
ALTER TABLE ontime
UPDATE Dest=Dest, Carrier=Carrier
WHERE 1=1
Skip indexes cut down on I/O
Default value
25. Effectiveness depends on data distribution
SELECT
Year, count(*) AS flights,
sum(Cancelled) / flights AS cancelled,
sum(DepDel15) / flights AS delayed_15
FROM airline.ontime WHERE [Column] = [Value] GROUP BY Year
Column Value Index Count Rows Processed Query Response
Dest PPG ngrambf_v1 525 4.30M 0.053
Dest ATL ngrambf_v1 9,360,581 166.81M 0.622
Carrier ML set 70,622 3.39M 0.090
Carrier WN set 25,918,402 166.24M 0.566
26. More table engines: CollapsingMergeTree
CREATE TABLE collapse (user_id UInt64, views UInt64, sign Int8)
ENGINE = CollapsingMergeTree(sign)
PARTITION BY tuple() ORDER BY user_id;
INSERT INTO collapse VALUES (32, 55, 1);
INSERT INTO collapse VALUES (32, 55, -1);
INSERT INTO collapse VALUES (32, 98, 1);
SELECT * FROM collapse FINAL
┌─user_id─┬─views─┬─sign─┐
│ 32 │ 98 │ 1 │
└─────────┴───────┴──────┘
37. Updates
CREATE TABLE views (
view_id UUID,
customer_id UUID,
user_id UUID,
view_time DateTime,
...
rebuffer_count: UInt64,
--- other metrics
operating_system: Nullable(String),
--- other filter dimensions
)
ReplicatedCollapsingMergeTree(...,Sign)
PARTITION BY toYYYYMMDD(view_time)
ORDER BY (customer_id, view_time, view_id)
● “Resumed” views
● Low percentage of rows get updated
● FINAL keyword sometimes
○ Yes when listing views
○ No on larger metrics queries
○ Metrics workarounds
■ SUM(metric * Sign) /
SUM(Sign)
● Nightly OPTIMIZE
38. Record Lookup
SELECT * FROM video_views
WHERE
view_time IN (
SELECT view_time FROM user_id_index
WHERE customer_id=0 AND user_id='mux'
)
AND customer_id=0 AND user_id='mux'
CREATE MATERIALIZED VIEW
video_views_index_user_id
ON CLUSTER metrics
ENGINE = ReplicatedCollapsingMergeTree(...,
Sign)
PARTITION BY toYYYYMMDD(view_end)
PRIMARY KEY (property_id, user_id)
ORDER BY (property_id, user_id, view_end)
POPULATE AS
SELECT
property_id,
user_id,
view_end,
Sign
FROM video_views
https://mux.com/blog/from-russia-with-love-how-clickhouse-saved-our-data/
43. Takeaways
● Clickhouse performance feels like magic
● Operational simplicity, especially around scaling
● Clickhouse has become our default for statistic data
44. Thank you!
We are both
hiring!
Mux:
https://mux.com
Altinity:
https://www.altinity.com