SlideShare uma empresa Scribd logo
1 de 52
Baixar para ler offline
从 Apache IoTDB 看高校学生的
Apache 开源实践
Developing Apache IoTDB:
Practice Experience from Young Students
Xiangdong Huang
Tsinghua University, Beijing, China
2019.11.09
Outline
• Who am I
• The Start
• Dream Disillusion
• A New Hope
Outline
• Who am I
• The Start
• Dream Disillusion
• A New Hope
Who am I
• Xiangdong Huang (sainthxd@gmail.com)
• Was a PhD student and PostDoc in Tsinghua University
• One of the initial committers of Apache IoTDB (incubating)
• Was a PhD student and PostDoc in Tsinghua University
The Start
• Was a PhD student and PostDoc in Tsinghua University
• it was the start of the following story when I knocked the door of
my supervisor’s office in 2011…
My supervisor
(Jianmin Wang)
me
My supervisor
(Jianmin Wang)
me
The Start
My supervisor
(Jianmin Wang)
me
Xiangdong, Why do you
want to be a PhD at
School of Software?
I want to develop
something that be used
by millions of people!
Come on!
Do some cool softwares that can be used by many many people.
Outline
• Who am I
• The Start
• Dream Disillusion
• A New Hope
As an Individual Developer
• Write a lot small “tools“
• But no maintaining
• Just for fun/self-use
Developer as a Student
• Many courses
• Do not need to write to much codes (in some home works)..
• Good for improve skill, and hard to get the full score (because some are really hard!).
Data Mining Modern Database
100 lines? innovation
Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
To use the
demo, we can
Step 1, click..
Step 2, click..
…
student
reviews
Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
To use the
demo, we can
Step 1, click..
Step 2, click..
…
What if I click
here first.
Developer as a Student
The figure is from the Internet… 图文无关。。。
Homework magic
weapons:
- Bootstrap
- Django
- MySQL
A beautiful web DEMO is done
To use the
demo, we can
Step 1, click..
Step 2, click..
…
STOP!
YOU
CANNOT!
What if I click
here first.
We are writing demo and demo and demo…
• Complex project management?
• Makefile? POM? Gradle?
• Agile? Scrum? Sprint?
• CI? CD?
A pom file example
From Apache PLC4x
At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
• 2.2.0, 2.2.1, …2.2.5;
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
Why
Cassandra
can update
so frequent?
At the same time, Big Data + Apache ..
• Hadoop
• HBase
• Cassandra
• ~200 k lines of codes
• 2.2.0, 2.2.1, …2.2.5;
• Patch
Please
implement some
functions
Ah, Hadoop + Hive
can do that!
Let me download it
Oops, an
exception!
Why
Cassandra
can update
so frequent?
Wow, someone
share a patch
file to fix a bug!
Yes, you are growing! You have known JIRA, etc..
• When can I get rid of writing demo, and do some
nice software like Apache Cassandra, Hadoop, etc..
Outline
• Who am I
• The Start
• Dream Disillusion
• A New Hope
A New Hope
• Be active in an existing open source community
• Hadoop, Cassandra, Spark etc..
• Be active in a new open source community
• IoTDB etc..
Time series data is everywhere
穿戴设备无人驾驶
A good DB can improve the whole process
Network
MQ Database
queryinsertion
save data
locally
Network
analysis
And no good software
RDB
KVDB
LSM based
•Efficient file structure
•More query functions
Not optimize for
some application
scenarios
TSDB
Limited number of
columns
1600 Columns in a table
Limited number of rows
<=10M rows is better
Manual Sharding
• Support big data
• Limited Queries
• Lack time filtering
• Lack value filtering
• Lack multiple time series
alignment
Based on PG
•Auto sharding
•Query optimization
Performance degrades
sharply after writing
data for a long time
Hbase/Cassandra based
•Partition by TS-UID
and time range
• Storage inefficiency
• Limit of queries
Do it ourselves
supervisor
students
Let’s develop a
time series DB!
Can we?
You can! And you
can do it in an
open source way.
And then learn a lot…
1. Teamwork
• Git with 10+ persons Team
• Commitlog
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
Let your software >= 100K Lines.
2. Learn skills
• Git with 10+ persons Team
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
• Project structure
Let your software powerful.
3. Stability/Agile
• Git with 10+ persons Team
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
• Project structure
• CI/CD
• Jenkins, travis-CI
Let your software really really can be used.
4. Open your mind
• Git with 10+ persons Team
• Conflict, merge, squash…
• Branches…(dev, release, stable…)
• Project structure
• CI/CD
• Jenkins, travis-CI
• Issue -> PR -> Release
Open your minds.
Improve your communication skills.
5. Research and Project
• User requirements -> Implementation -> IoTDB -> User
• Idea -> Implementation -> IoTDB -> Evaluation -> Paper -> User
• Paper -> Implementation -> IoTDB -> Evaluation -> User
OK….
• Past
• I can write a demo
• I like to write something
• I like to write something used
by myself
• Now
• I/We know how to write a
complex software
• I/We know how to write a
software used by people
Do it ourselves
• Know a lot about how Apache project are developed!
• How the website of an Apache project is built?
• Who can be a committer of an Apache project?
• How to release projects?
• Who decides the new features of an Apache project?
• Etc..
Time Series DB for Industrial Internet
“清华数为” 时间序列数据库 -->Apache IoTDB (incubating)
• Apache IoTDB (incubating) is a
high efficient Database for
managing time series data,
especially in Industry Internet
applications.
• A young community. Donated by
Tsinghua University, 2018.11-18
entered the incubator.
• Devoted to building the best time
series database (in IoT area) in the
world.
• Apache IoTDB v0.8.1 is released!
v0.9.0 is coming!
Developers and Users
Concepts in IoTDB (The Schema)
Device (i.e., Data source)
• A machine instance
Measurement (e.g., sensor)
• A device can have many measurements
Time Series
• Device + Measurement
• is represented as a path that begins with root, like
“root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain”
Storage Group (SG)
• A storage group can have many devices
• Storage groups have independent resources
(threads and files) to increase parallelism and
reduce competitions for locks.
Cadillac XT5
The schema mapping
root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain
root.Cadillac_XT5.USA.CA.7BTC409.speed
root.Cadillac_XT5.USA.NV.6BAC321.speed
country state device name timestamp fuelRemain speed
USA CA 7BTC409 t1 5.0 120
USA CA 7BTC409 t2 4.9 109
USA CA 6BAC321 t1 NULL 50
USA CA 6BAC321 t3 NULL 65
Table Name: Cadillac_XT5
Tags and Fields in InfluxDB, KariosDB, OpenTSDB…
called as Measurement in InfluxDB
Set time series group
SET STORAGE GROUP TO root.laptop.d1.s1;
Create Timeseries
CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE
Insert Data
INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234);
Delete Data
DALETE FROM d1.s1 WHERE time < 1000;
Update Data
UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000;
Query Data (Filter, Aggregation, Group by time interval)
SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5;
SELECT count(status), max_value(temperature) from root.ln.wf01.wt01;
SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11-
03T23:00:00]);
SQL in IoTDB
Supported data type
• Boolean
• Int
• Long
• Float
• Double
• String
• GPS (TODO) -> for trajectory data management
• Array (TODO) -> for unstructured data management
41
TsFile: Zip File Born for Time Series Data
Columnar
Store
- Reduce Disk I/O
- Improve Compression
Compression
&
Encoding
- Improve Compression Greatly
- 15% Better than InfluxDB in
Real Applications
Time-domain
Statistics Info
Natively
- Support Fast Query in
- Time Domain
- Value Domain
- Freq Domain (TODO)
detailed specification:
http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3
https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format
Adaptive Delta encoding – Int or Long (TODO)
Gorilla encoding – Float or Double
128, 136, 144, 152, 160, …
8, 8, 8, 8 1st difference is constant.
0, 0, 0 2nd difference is 1-bit storage needed!
128, 135, 143, 154, 163, …
7, 8, 11, 9 1st difference is not constant though
1, 3, -2 2nd difference is 2-bit storage needed!
• Unified support of fixed frequency times series
or irregular frequency time series
TS2Diff encoding – Int or Long (timestamps)
• A adaptive enhance for TS2Diff.
• See next page.
RLE encoding – repeated Int or Long
• For repeated sequence: store a value and its count
Bit-Packing encoding – Int or Long
• Store data in compact form
• squeeze out wasteful bits
• XOR consecutive data points
• Store with variable length encoding scheme
Snappy Gzip (TODO) LZO (TODO)
Compression Algorithm
TsFile: Encoding and Compression
Adaptive TS2Diff encoding – Int or Long (TODO)
• For time series with outliers or missing points
• Storing second-order delta values and a boolean flag array.
TsFile: Encoding and Compression
Time Series Specific Operations (TODO)
Pattern Matching for Streaming Time Series Data
Split the pattern and data stream into
equal length fragments
Extract features to reduce the dimension
Accelerate the search by using features
Scenario:fault alarm in real time
44
SELECT wind_3s FROM china.farm1.tb2
WHERE time > t1 AND time < t2
AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0)
Similarity Search of Sub-series
Indexing data using Key-Value form
Scenarios:
Outlier detection
Historical data analysis
…
From Edge to Cloud: Run IoTDB Everywhere
Time series data files: high-tech
write, high compression ratio,
support simple queries. Simply
put, TsFile is a zip file for time
series data.
Suitable for embedded devices,
general servers, data centers, etc.
TsFile (a component of IoTDB)
A zip file of time series
Freely operate time series of
multiple TsFiles, including: CRUD
and advanced query like:max, min,
avg and temporal alignment.
Scene: Embedded equipment, on-
site industrial computer, general
server, etc.
IoTDB
A database of time series
3rd Systems
Easy to use and integrate for
complex analysis(data fusion,
collaborative recommendation,
machine learning)
Scene: Cloud data center
A data warehouse of time series
A Process to Manage Time Series Data
data source
or
JDBC / Session API
JDBC / Session API
Grafana-Adaptor Spark-TsFile-AdaptorJDBC
Analysis with Big Data Framework
(big data set)
Analysis with Matlab
(small data set)
Visualization
(Manual data explore)
https://github.com/jixuan1989/iotdb-tutorial
Latest version v0.8 (0.9.0-snapshot)
Apache IoTDB-incubating v0.9.0-SNAPSHOT
Xeon E5v4
256G Mem
HDD Disk
#Client #Storage
Group
#Device #Measurem
ent per
Device
DataType Encoding Compressio
n
BatchSize #Point per
Time Series
10 50 1000 100 Float RLE Snappy 100 100000
Insertion
#Client #Storage
Group
#Device #Measure
ment per
Device
DataType Encoding Compressi
on
BatchSize #Point per Time
Series
50 1 1 10 Float RLE Snappy 100 100000000
Query
Compression
Apache IoTDB-incubating v0.9.0-SNAPSHOT
Xeon E5v4
256G Mem
HDD Disk
Raw data:
- 12 Bytes per point
- 112 GB totally
Write Performance: points/s(single node)
Xeon E5v4
256G Mem
HDD Disk
* In this experiment, we do not use IoTDB’s JDBC API and SQL interface.
Instead, we use a raw API like Cassnadra’s Raw Thrift API.
Apache IoTDB-incubating v0.9.0-SNAPSHOT
Query Performance: aggregation count()
InfluxDB failed to return
any answers in the
100,000,000 setting.
Xeon E5v4
256G Mem
HDD Disk
Apache IoTDB-incubating v0.9.0-SNAPSHOT
Shanghai METRO Monitoring
…
144 trains
9 KairosDB + Cassandra
3200 points/500 ms/train
14 Restful service just for avoiding
modifying current programs
KDB compatible
Restful Service
KDB compatible
Restful Service
KDB compatible
Restful Service
ONE IoTDB
instance
300 trains
3200 points/200 ms/train
414 Billion
data points
per day
just using
ONE IoTDB
instance
upgrade
Join Us
• Mail list:
• subscribe: dev-
subscribe@iotdb.incubator.apache.org
• discussion: dev@iotdb.apache.org
• !中英文皆可!(推荐英文)
• bug report: https://s.apache.org/iotdb-issues
• !中英文皆可!(推荐英文)
• Website: https://iotdb.apache.org
钉钉用户交流群
官方网站
IoTDB社区建设:
• 邀请更多开发者/用户/学生加入社区,共同成长
• 是本科学生毕设、研究生实习的最佳选择之一!
• 欢迎外地学生/开发者(邀请参加>=1次北京meetup)

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Building a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patentsBuilding a lightweight discovery interface for Chinese patents
Building a lightweight discovery interface for Chinese patents
 
Searching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data WorldSearching Chinese Patents Presentation at Enterprise Data World
Searching Chinese Patents Presentation at Enterprise Data World
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
 
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
MaaS (Model as a Service): Modern Streaming Data Science with Apache Metron (...
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
PyData Texas 2015 Keynote
PyData Texas 2015 KeynotePyData Texas 2015 Keynote
PyData Texas 2015 Keynote
 
AWS re:Invent 2016: Open-Source Resources (DCS201)
AWS re:Invent 2016: Open-Source Resources (DCS201)AWS re:Invent 2016: Open-Source Resources (DCS201)
AWS re:Invent 2016: Open-Source Resources (DCS201)
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
 
Use cases for cassandra in federal and state government
Use cases for cassandra in federal and state governmentUse cases for cassandra in federal and state government
Use cases for cassandra in federal and state government
 
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
Spectator to Participant. Contributing to Cassandra (Patrick McFadin, DataSta...
 
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
Ubiquitous Solr - A Database's Not-So-Evil Twin: Presented by Ayon Sinha, Wal...
 
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...
 
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth RedmoreH2O World - Clustering & Feature Extraction on Text - Seth Redmore
H2O World - Clustering & Feature Extraction on Text - Seth Redmore
 
Azure HDInsight
Azure HDInsightAzure HDInsight
Azure HDInsight
 
Splunk Spark Integration
Splunk Spark IntegrationSplunk Spark Integration
Splunk Spark Integration
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1Hdfs 2016-hadoop-summit-dublin-v1
Hdfs 2016-hadoop-summit-dublin-v1
 
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhDSpark with Azure HDInsight  - Tampa Bay Data Science - Adnan Masood, PhD
Spark with Azure HDInsight - Tampa Bay Data Science - Adnan Masood, PhD
 
Intro to Search
Intro to SearchIntro to Search
Intro to Search
 

Semelhante a From a student to an apache committer practice of apache io tdb

Lessons learned from building Demand Side Platform
Lessons learned from building Demand Side PlatformLessons learned from building Demand Side Platform
Lessons learned from building Demand Side Platform
bbogacki
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
yhadoop
 

Semelhante a From a student to an apache committer practice of apache io tdb (20)

Stackato v5
Stackato v5Stackato v5
Stackato v5
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)
 
Know thy cost (or where performance problems lurk)
Know thy cost (or where performance problems lurk)Know thy cost (or where performance problems lurk)
Know thy cost (or where performance problems lurk)
 
Lessons learned from building Demand Side Platform
Lessons learned from building Demand Side PlatformLessons learned from building Demand Side Platform
Lessons learned from building Demand Side Platform
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
Stackato v3
Stackato v3Stackato v3
Stackato v3
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Elastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @DatadogElastic Data Analytics Platform @Datadog
Elastic Data Analytics Platform @Datadog
 
Ncku csie talk about Spark
Ncku csie talk about SparkNcku csie talk about Spark
Ncku csie talk about Spark
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Stackato v6
Stackato v6Stackato v6
Stackato v6
 
Stackato v2
Stackato v2Stackato v2
Stackato v2
 
Stackato v4
Stackato v4Stackato v4
Stackato v4
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Data Science
Data ScienceData Science
Data Science
 
Hadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University TalksHadoop at Yahoo! -- University Talks
Hadoop at Yahoo! -- University Talks
 
Introduction to Azure DocumentDB
Introduction to Azure DocumentDBIntroduction to Azure DocumentDB
Introduction to Azure DocumentDB
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introduction
 
Be faster then rabbits
Be faster then rabbitsBe faster then rabbits
Be faster then rabbits
 

Mais de jixuan1989

Mais de jixuan1989 (7)

Apache IoTDB 的前世今生与部分技术细节 2020-01
Apache IoTDB 的前世今生与部分技术细节 2020-01Apache IoTDB 的前世今生与部分技术细节 2020-01
Apache IoTDB 的前世今生与部分技术细节 2020-01
 
基于Apache IoTDB的时序数据开源解决方案2020-1-4
基于Apache IoTDB的时序数据开源解决方案2020-1-4基于Apache IoTDB的时序数据开源解决方案2020-1-4
基于Apache IoTDB的时序数据开源解决方案2020-1-4
 
Apache IoTDB 工业互联网时序数据库 meetup-2019.12
Apache IoTDB 工业互联网时序数据库 meetup-2019.12Apache IoTDB 工业互联网时序数据库 meetup-2019.12
Apache IoTDB 工业互联网时序数据库 meetup-2019.12
 
The practice of enjoying apache
The practice of enjoying apacheThe practice of enjoying apache
The practice of enjoying apache
 
Willem Ning Jiang: Getting Started: How to join an Open Source project Apache...
Willem Ning Jiang: Getting Started: How to join an Open Source project Apache...Willem Ning Jiang: Getting Started: How to join an Open Source project Apache...
Willem Ning Jiang: Getting Started: How to join an Open Source project Apache...
 
Craig The apache Way
Craig The apache Way Craig The apache Way
Craig The apache Way
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 

Último

Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
fonyou31
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Krashi Coaching
 

Último (20)

Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 

From a student to an apache committer practice of apache io tdb

  • 1. 从 Apache IoTDB 看高校学生的 Apache 开源实践 Developing Apache IoTDB: Practice Experience from Young Students Xiangdong Huang Tsinghua University, Beijing, China 2019.11.09
  • 2. Outline • Who am I • The Start • Dream Disillusion • A New Hope
  • 3. Outline • Who am I • The Start • Dream Disillusion • A New Hope
  • 4. Who am I • Xiangdong Huang (sainthxd@gmail.com) • Was a PhD student and PostDoc in Tsinghua University • One of the initial committers of Apache IoTDB (incubating)
  • 5. • Was a PhD student and PostDoc in Tsinghua University
  • 6. The Start • Was a PhD student and PostDoc in Tsinghua University • it was the start of the following story when I knocked the door of my supervisor’s office in 2011… My supervisor (Jianmin Wang) me My supervisor (Jianmin Wang) me
  • 7. The Start My supervisor (Jianmin Wang) me Xiangdong, Why do you want to be a PhD at School of Software? I want to develop something that be used by millions of people! Come on! Do some cool softwares that can be used by many many people.
  • 8. Outline • Who am I • The Start • Dream Disillusion • A New Hope
  • 9. As an Individual Developer • Write a lot small “tools“ • But no maintaining • Just for fun/self-use
  • 10. Developer as a Student • Many courses • Do not need to write to much codes (in some home works).. • Good for improve skill, and hard to get the full score (because some are really hard!). Data Mining Modern Database 100 lines? innovation
  • 11. Developer as a Student The figure is from the Internet… 图文无关。。。 Homework magic weapons: - Bootstrap - Django - MySQL A beautiful web DEMO is done
  • 12. Developer as a Student The figure is from the Internet… 图文无关。。。 Homework magic weapons: - Bootstrap - Django - MySQL A beautiful web DEMO is done To use the demo, we can Step 1, click.. Step 2, click.. … student reviews
  • 13. Developer as a Student The figure is from the Internet… 图文无关。。。 Homework magic weapons: - Bootstrap - Django - MySQL A beautiful web DEMO is done To use the demo, we can Step 1, click.. Step 2, click.. … What if I click here first.
  • 14. Developer as a Student The figure is from the Internet… 图文无关。。。 Homework magic weapons: - Bootstrap - Django - MySQL A beautiful web DEMO is done To use the demo, we can Step 1, click.. Step 2, click.. … STOP! YOU CANNOT! What if I click here first.
  • 15. We are writing demo and demo and demo… • Complex project management? • Makefile? POM? Gradle? • Agile? Scrum? Sprint? • CI? CD? A pom file example From Apache PLC4x
  • 16. At the same time, Big Data + Apache .. • Hadoop • HBase • Cassandra Please implement some functions Ah, Hadoop + Hive can do that! Let me download it
  • 17. At the same time, Big Data + Apache .. • Hadoop • HBase • Cassandra • ~200 k lines of codes Please implement some functions Ah, Hadoop + Hive can do that! Let me download it Oops, an exception!
  • 18. At the same time, Big Data + Apache .. • Hadoop • HBase • Cassandra • ~200 k lines of codes • 2.2.0, 2.2.1, …2.2.5; Please implement some functions Ah, Hadoop + Hive can do that! Let me download it Oops, an exception! Why Cassandra can update so frequent?
  • 19. At the same time, Big Data + Apache .. • Hadoop • HBase • Cassandra • ~200 k lines of codes • 2.2.0, 2.2.1, …2.2.5; • Patch Please implement some functions Ah, Hadoop + Hive can do that! Let me download it Oops, an exception! Why Cassandra can update so frequent? Wow, someone share a patch file to fix a bug! Yes, you are growing! You have known JIRA, etc..
  • 20. • When can I get rid of writing demo, and do some nice software like Apache Cassandra, Hadoop, etc..
  • 21. Outline • Who am I • The Start • Dream Disillusion • A New Hope
  • 22. A New Hope • Be active in an existing open source community • Hadoop, Cassandra, Spark etc.. • Be active in a new open source community • IoTDB etc..
  • 23. Time series data is everywhere 穿戴设备无人驾驶
  • 24. A good DB can improve the whole process Network MQ Database queryinsertion save data locally Network analysis
  • 25. And no good software RDB KVDB LSM based •Efficient file structure •More query functions Not optimize for some application scenarios TSDB Limited number of columns 1600 Columns in a table Limited number of rows <=10M rows is better Manual Sharding • Support big data • Limited Queries • Lack time filtering • Lack value filtering • Lack multiple time series alignment Based on PG •Auto sharding •Query optimization Performance degrades sharply after writing data for a long time Hbase/Cassandra based •Partition by TS-UID and time range • Storage inefficiency • Limit of queries
  • 26. Do it ourselves supervisor students Let’s develop a time series DB! Can we? You can! And you can do it in an open source way. And then learn a lot…
  • 27. 1. Teamwork • Git with 10+ persons Team • Commitlog • Conflict, merge, squash… • Branches…(dev, release, stable…) Let your software >= 100K Lines.
  • 28. 2. Learn skills • Git with 10+ persons Team • Conflict, merge, squash… • Branches…(dev, release, stable…) • Project structure Let your software powerful.
  • 29. 3. Stability/Agile • Git with 10+ persons Team • Conflict, merge, squash… • Branches…(dev, release, stable…) • Project structure • CI/CD • Jenkins, travis-CI Let your software really really can be used.
  • 30. 4. Open your mind • Git with 10+ persons Team • Conflict, merge, squash… • Branches…(dev, release, stable…) • Project structure • CI/CD • Jenkins, travis-CI • Issue -> PR -> Release Open your minds. Improve your communication skills.
  • 31. 5. Research and Project • User requirements -> Implementation -> IoTDB -> User • Idea -> Implementation -> IoTDB -> Evaluation -> Paper -> User • Paper -> Implementation -> IoTDB -> Evaluation -> User
  • 32. OK…. • Past • I can write a demo • I like to write something • I like to write something used by myself • Now • I/We know how to write a complex software • I/We know how to write a software used by people
  • 33. Do it ourselves • Know a lot about how Apache project are developed! • How the website of an Apache project is built? • Who can be a committer of an Apache project? • How to release projects? • Who decides the new features of an Apache project? • Etc..
  • 34. Time Series DB for Industrial Internet “清华数为” 时间序列数据库 -->Apache IoTDB (incubating) • Apache IoTDB (incubating) is a high efficient Database for managing time series data, especially in Industry Internet applications. • A young community. Donated by Tsinghua University, 2018.11-18 entered the incubator. • Devoted to building the best time series database (in IoT area) in the world. • Apache IoTDB v0.8.1 is released! v0.9.0 is coming!
  • 36. Concepts in IoTDB (The Schema) Device (i.e., Data source) • A machine instance Measurement (e.g., sensor) • A device can have many measurements Time Series • Device + Measurement • is represented as a path that begins with root, like “root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain” Storage Group (SG) • A storage group can have many devices • Storage groups have independent resources (threads and files) to increase parallelism and reduce competitions for locks. Cadillac XT5
  • 37. The schema mapping root.Cadillac_XT5.USA.CA.7BTC409.fuelRemain root.Cadillac_XT5.USA.CA.7BTC409.speed root.Cadillac_XT5.USA.NV.6BAC321.speed country state device name timestamp fuelRemain speed USA CA 7BTC409 t1 5.0 120 USA CA 7BTC409 t2 4.9 109 USA CA 6BAC321 t1 NULL 50 USA CA 6BAC321 t3 NULL 65 Table Name: Cadillac_XT5 Tags and Fields in InfluxDB, KariosDB, OpenTSDB… called as Measurement in InfluxDB
  • 38. Set time series group SET STORAGE GROUP TO root.laptop.d1.s1; Create Timeseries CREATE TIMESERIES root.laptop.d1.s1 WITH DATATYPE=INT32, ENCODING=RLE Insert Data INSERT INTO (d1.s1,d1.s2,time) VALUES (1000,2000,14735235234); Delete Data DALETE FROM d1.s1 WHERE time < 1000; Update Data UPDATE d1.s1 SET VALUE = 2000 WHERE time < 2000 and time > 1000; Query Data (Filter, Aggregation, Group by time interval) SELECT d1.s1,d2.* FROM BJ.WF1 WHERE d1.s1 < 2000 and d2.s2 > 1000 and freq(d2.s3) > 0.5; SELECT count(status), max_value(temperature) from root.ln.wf01.wt01; SELECT count(status) ) from root.ln.wf01.wt01 group by(1h, [2017-11-03T00:00:00, 2017-11- 03T23:00:00]); SQL in IoTDB
  • 39. Supported data type • Boolean • Int • Long • Float • Double • String • GPS (TODO) -> for trajectory data management • Array (TODO) -> for unstructured data management
  • 40.
  • 41. 41 TsFile: Zip File Born for Time Series Data Columnar Store - Reduce Disk I/O - Improve Compression Compression & Encoding - Improve Compression Greatly - 15% Better than InfluxDB in Real Applications Time-domain Statistics Info Natively - Support Fast Query in - Time Domain - Value Domain - Freq Domain (TODO) detailed specification: http://iotdb.apache.org/#/Documents/0.8.0/chap7/sec3 https://cwiki.apache.org/confluence/display/IOTDB/TsFile+Format
  • 42. Adaptive Delta encoding – Int or Long (TODO) Gorilla encoding – Float or Double 128, 136, 144, 152, 160, … 8, 8, 8, 8 1st difference is constant. 0, 0, 0 2nd difference is 1-bit storage needed! 128, 135, 143, 154, 163, … 7, 8, 11, 9 1st difference is not constant though 1, 3, -2 2nd difference is 2-bit storage needed! • Unified support of fixed frequency times series or irregular frequency time series TS2Diff encoding – Int or Long (timestamps) • A adaptive enhance for TS2Diff. • See next page. RLE encoding – repeated Int or Long • For repeated sequence: store a value and its count Bit-Packing encoding – Int or Long • Store data in compact form • squeeze out wasteful bits • XOR consecutive data points • Store with variable length encoding scheme Snappy Gzip (TODO) LZO (TODO) Compression Algorithm TsFile: Encoding and Compression
  • 43. Adaptive TS2Diff encoding – Int or Long (TODO) • For time series with outliers or missing points • Storing second-order delta values and a boolean flag array. TsFile: Encoding and Compression
  • 44. Time Series Specific Operations (TODO) Pattern Matching for Streaming Time Series Data Split the pattern and data stream into equal length fragments Extract features to reduce the dimension Accelerate the search by using features Scenario:fault alarm in real time 44 SELECT wind_3s FROM china.farm1.tb2 WHERE time > t1 AND time < t2 AND wind_3s LIKE PATTERN(7.2,..,20.3,..,6.0) Similarity Search of Sub-series Indexing data using Key-Value form Scenarios: Outlier detection Historical data analysis …
  • 45. From Edge to Cloud: Run IoTDB Everywhere Time series data files: high-tech write, high compression ratio, support simple queries. Simply put, TsFile is a zip file for time series data. Suitable for embedded devices, general servers, data centers, etc. TsFile (a component of IoTDB) A zip file of time series Freely operate time series of multiple TsFiles, including: CRUD and advanced query like:max, min, avg and temporal alignment. Scene: Embedded equipment, on- site industrial computer, general server, etc. IoTDB A database of time series 3rd Systems Easy to use and integrate for complex analysis(data fusion, collaborative recommendation, machine learning) Scene: Cloud data center A data warehouse of time series
  • 46. A Process to Manage Time Series Data data source or JDBC / Session API JDBC / Session API Grafana-Adaptor Spark-TsFile-AdaptorJDBC Analysis with Big Data Framework (big data set) Analysis with Matlab (small data set) Visualization (Manual data explore) https://github.com/jixuan1989/iotdb-tutorial
  • 47. Latest version v0.8 (0.9.0-snapshot) Apache IoTDB-incubating v0.9.0-SNAPSHOT Xeon E5v4 256G Mem HDD Disk #Client #Storage Group #Device #Measurem ent per Device DataType Encoding Compressio n BatchSize #Point per Time Series 10 50 1000 100 Float RLE Snappy 100 100000 Insertion #Client #Storage Group #Device #Measure ment per Device DataType Encoding Compressi on BatchSize #Point per Time Series 50 1 1 10 Float RLE Snappy 100 100000000 Query
  • 48. Compression Apache IoTDB-incubating v0.9.0-SNAPSHOT Xeon E5v4 256G Mem HDD Disk Raw data: - 12 Bytes per point - 112 GB totally
  • 49. Write Performance: points/s(single node) Xeon E5v4 256G Mem HDD Disk * In this experiment, we do not use IoTDB’s JDBC API and SQL interface. Instead, we use a raw API like Cassnadra’s Raw Thrift API. Apache IoTDB-incubating v0.9.0-SNAPSHOT
  • 50. Query Performance: aggregation count() InfluxDB failed to return any answers in the 100,000,000 setting. Xeon E5v4 256G Mem HDD Disk Apache IoTDB-incubating v0.9.0-SNAPSHOT
  • 51. Shanghai METRO Monitoring … 144 trains 9 KairosDB + Cassandra 3200 points/500 ms/train 14 Restful service just for avoiding modifying current programs KDB compatible Restful Service KDB compatible Restful Service KDB compatible Restful Service ONE IoTDB instance 300 trains 3200 points/200 ms/train 414 Billion data points per day just using ONE IoTDB instance upgrade
  • 52. Join Us • Mail list: • subscribe: dev- subscribe@iotdb.incubator.apache.org • discussion: dev@iotdb.apache.org • !中英文皆可!(推荐英文) • bug report: https://s.apache.org/iotdb-issues • !中英文皆可!(推荐英文) • Website: https://iotdb.apache.org 钉钉用户交流群 官方网站 IoTDB社区建设: • 邀请更多开发者/用户/学生加入社区,共同成长 • 是本科学生毕设、研究生实习的最佳选择之一! • 欢迎外地学生/开发者(邀请参加>=1次北京meetup)