This talk is introduce by Xiangdong Huang, who is a PPMC of Apache IoTDB (incubating) project, at Apache Event at Tsinghua University in China.
About the Event:
The open source ecosystem plays more and more important role in the world. Open source software is widely used in operating systems, cloud computing, big data, artificial intelligence, and industrial Internet. Many companies have gradually increased their participation in the open source community. Developers with open source experience are increasingly valued and favored by large enterprises. The Apache Software Foundation is one of the most important open source communities, contributing a large number of valuable open source software and communities to the world.
The invited guests of this lecture are all from ASF community, including the chairman of the Apache Software Foundation, three Apache members, Top 5 Apache code committers (according to Apache annual report), the first Committer in the Hadoop project in China, several Apache project mentors or VPs, and many Apache Committers. They will tell you what the open source culture is, how to join the Apache open source community, and the Apache Way.
From a student to an apache committer practice of apache io tdb
1. 从 Apache IoTDB 看高校学生的
Apache 开源实践
Developing Apache IoTDB:
Practice Experience from Young Students
Xiangdong Huang
Tsinghua University, Beijing, China
2019.11.09
4. Who am I
• Xiangdong Huang (sainthxd@gmail.com)
• Was a PhD student and PostDoc in Tsinghua University
• One of the initial committers of Apache IoTDB (incubating)
5. • Was a PhD student and PostDoc in Tsinghua University
6. The Start
• Was a PhD student and PostDoc in Tsinghua University
• it was the start of the following story when I knocked the door of
my supervisor’s office in 2011…
My supervisor
(Jianmin Wang)
me
My supervisor
(Jianmin Wang)
me
7. The Start
My supervisor
(Jianmin Wang)
me
Xiangdong, Why do you
want to be a PhD at
School of Software?
I want to develop
something that be used
by millions of people!
Come on!
Do some cool softwares that can be used by many many people.
9. As an Individual Developer
• Write a lot small “tools“
• But no maintaining
• Just for fun/self-use
10. Developer as a Student
• Many courses
• Do not need to write to much codes (in some home works)..
• Good for improve skill, and hard to get the full score (because some are really hard!).
Data Mining Modern Database
100 lines? innovation
11. Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
12. Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
To use the
demo, we can
Step 1, click..
Step 2, click..
…
student
reviews
13. Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
To use the
demo, we can
Step 1, click..
Step 2, click..
…
What if I click
here first.
14. Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
To use the
demo, we can
Step 1, click..
Step 2, click..
…
STOP!
YOU
CANNOT!
What if I click
here first.
15. We are writing demo and demo and demo…
• Complex project management?
• Makefile? POM? Gradle?
• Agile? Scrum? Sprint?
• CI? CD?
A pom file example
From Apache PLC4x
16. At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
17. At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
18. At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
• 2.2.0, 2.2.1, …2.2.5;
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
Why
Cassandra
can update
so frequent?
19. At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
• 2.2.0, 2.2.1, …2.2.5;
• Patch
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
Why
Cassandra
can update
so frequent?
Wow, someone
share a patch
file to fix a bug!
Yes, you are growing! You have known JIRA, etc..
20. • When can I get rid of writing demo, and do some
nice software like Apache Cassandra, Hadoop, etc..
22. A New Hope
• Be active in an existing open source community
• Hadoop, Cassandra, Spark etc..
• Be active in a new open source community
• IoTDB etc..
24. A good DB can improve the whole process
Network
MQ Database
queryinsertion
save data
locally
Network
analysis
25. And no good software
RDB
KVDB
LSM based
•Efficient file structure
•More query functions
Not optimize for
some application
scenarios
TSDB
Limited number of
columns
1600 Columns in a table
Limited number of rows
<=10M rows is better
Manual Sharding
• Support big data
• Limited Queries
• Lack time filtering
• Lack value filtering
• Lack multiple time series
alignment
Based on PG
•Auto sharding
•Query optimization
Performance degrades
sharply after writing
data for a long time
Hbase/Cassandra based
•Partition by TS-UID
and time range
• Storage inefficiency
• Limit of queries
27. 1. Teamwork
• Git with 10+ persons Team
• Commitlog
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
Let your software >= 100K Lines.
28. 2. Learn skills
• Git with 10+ persons Team
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
• Project structure
Let your software powerful.
29. 3. Stability/Agile
• Git with 10+ persons Team
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
• Project structure
• CI/CD
• Jenkins, travis-CI
Let your software really really can be used.
30. 4. Open your mind
• Git with 10+ persons Team
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
• Project structure
• CI/CD
• Jenkins, travis-CI
• Issue -> PR -> Release
Open your minds.
Improve your communication skills.
31. 5. Research and Project
• User requirements -> Implementation -> IoTDB -> User
• Idea -> Implementation -> IoTDB -> Evaluation -> Paper -> User
• Paper -> Implementation -> IoTDB -> Evaluation -> User
32. OK….
• Past
• I can write a demo
• I like to write something
• I like to write something used
by myself
• Now
• I/We know how to write a
complex software
• I/We know how to write a
software used by people
33. Do it ourselves
• Know a lot about how Apache project are developed!
• How the website of an Apache project is built?
• Who can be a committer of an Apache project?
• How to release projects?
• Who decides the new features of an Apache project?
• Etc..
34. Time Series DB for Industrial Internet
“清华数为” 时间序列数据库 -->Apache IoTDB (incubating)
• Apache IoTDB (incubating) is a
high efficient Database for
managing time series data,
especially in Industry Internet
applications.
• A young community. Donated by
Tsinghua University, 2018.11-18
entered the incubator.
• Devoted to building the best time
series database (in IoT area) in the
world.
• Apache IoTDB v0.8.1 is released!
v0.9.0 is coming!
36. Concepts in IoTDB (The Schema)
Device (i.e., Data source)
• A machine instance
Measurement (e.g., sensor)
• A device can have many measurements
Time Series
• Device + Measurement
• is represented as a path that begins with root, like
“root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain”
Storage Group (SG)
• A storage group can have many devices
• Storage groups have independent resources
(threads and files) to increase parallelism and
reduce competitions for locks.
Cadillac XT5
38. Set time series group
SET STORAGE GROUP TO root.laptop.d1.s1;
Create Timeseries
CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE
Insert Data
INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234);
Delete Data
DALETE FROM d1.s1 WHERE time < 1000;
Update Data
UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000;
Query Data (Filter, Aggregation, Group by time interval)
SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5;
SELECT count(status), max_value(temperature) from root.ln.wf01.wt01;
SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11-
03T23:00:00]);
SQL in IoTDB
39. Supported data type
• Boolean
• Int
• Long
• Float
• Double
• String
• GPS (TODO) -> for trajectory data management
• Array (TODO) -> for unstructured data management
40.
41. 41
TsFile: Zip File Born for Time Series Data
Columnar
Store
- Reduce Disk I/O
- Improve Compression
Compression
&
Encoding
- Improve Compression Greatly
- 15% Better than InfluxDB in
Real Applications
Time-domain
Statistics Info
Natively
- Support Fast Query in
- Time Domain
- Value Domain
- Freq Domain (TODO)
detailed specification:
http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3
https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format
42. Adaptive Delta encoding – Int or Long (TODO)
Gorilla encoding – Float or Double
128, 136, 144, 152, 160, …
8, 8, 8, 8 1st difference is constant.
0, 0, 0 2nd difference is 1-bit storage needed!
128, 135, 143, 154, 163, …
7, 8, 11, 9 1st difference is not constant though
1, 3, -2 2nd difference is 2-bit storage needed!
• Unified support of fixed frequency times series
or irregular frequency time series
TS2Diff encoding – Int or Long (timestamps)
• A adaptive enhance for TS2Diff.
• See next page.
RLE encoding – repeated Int or Long
• For repeated sequence: store a value and its count
Bit-Packing encoding – Int or Long
• Store data in compact form
• squeeze out wasteful bits
• XOR consecutive data points
• Store with variable length encoding scheme
Snappy Gzip (TODO) LZO (TODO)
Compression Algorithm
TsFile: Encoding and Compression
43. Adaptive TS2Diff encoding – Int or Long (TODO)
• For time series with outliers or missing points
• Storing second-order delta values and a boolean flag array.
TsFile: Encoding and Compression
44. Time Series Specific Operations (TODO)
Pattern Matching for Streaming Time Series Data
Split the pattern and data stream into
equal length fragments
Extract features to reduce the dimension
Accelerate the search by using features
Scenario:fault alarm in real time
44
SELECT wind_3s FROM china.farm1.tb2
WHERE time > t1 AND time < t2
AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0)
Similarity Search of Sub-series
Indexing data using Key-Value form
Scenarios:
Outlier detection
Historical data analysis
…
45. From Edge to Cloud: Run IoTDB Everywhere
Time series data files: high-tech
write, high compression ratio,
support simple queries. Simply
put, TsFile is a zip file for time
series data.
Suitable for embedded devices,
general servers, data centers, etc.
TsFile (a component of IoTDB)
A zip file of time series
Freely operate time series of
multiple TsFiles, including: CRUD
and advanced query like:max, min,
avg and temporal alignment.
Scene: Embedded equipment, on-
site industrial computer, general
server, etc.
IoTDB
A database of time series
3rd Systems
Easy to use and integrate for
complex analysis(data fusion,
collaborative recommendation,
machine learning)
Scene: Cloud data center
A data warehouse of time series
46. A Process to Manage Time Series Data
data source
or
JDBC / Session API
JDBC / Session API
Grafana-Adaptor Spark-TsFile-AdaptorJDBC
Analysis with Big Data Framework
(big data set)
Analysis with Matlab
(small data set)
Visualization
(Manual data explore)
https://github.com/jixuan1989/iotdb-tutorial
47. Latest version v0.8 (0.9.0-snapshot)
Apache IoTDB-incubating v0.9.0-SNAPSHOT
Xeon E5v4
256G Mem
HDD Disk
#Client #Storage
Group
#Device #Measurem
ent per
Device
DataType Encoding Compressio
n
BatchSize #Point per
Time Series
10 50 1000 100 Float RLE Snappy 100 100000
Insertion
#Client #Storage
Group
#Device #Measure
ment per
Device
DataType Encoding Compressi
on
BatchSize #Point per Time
Series
50 1 1 10 Float RLE Snappy 100 100000000
Query
49. Write Performance: points/s(single node)
Xeon E5v4
256G Mem
HDD Disk
* In this experiment, we do not use IoTDB’s JDBC API and SQL interface.
Instead, we use a raw API like Cassnadra’s Raw Thrift API.
Apache IoTDB-incubating v0.9.0-SNAPSHOT
50. Query Performance: aggregation count()
InfluxDB failed to return
any answers in the
100,000,000 setting.
Xeon E5v4
256G Mem
HDD Disk
Apache IoTDB-incubating v0.9.0-SNAPSHOT
51. Shanghai METRO Monitoring
…
144 trains
9 KairosDB + Cassandra
3200 points/500 ms/train
14 Restful service just for avoiding
modifying current programs
KDB compatible
Restful Service
KDB compatible
Restful Service
KDB compatible
Restful Service
ONE IoTDB
instance
300 trains
3200 points/200 ms/train
414 Billion
data points
per day
just using
ONE IoTDB
instance
upgrade