Enhancing Environmental Data
Forecasting Performance by
Utilizing Multi-region Data with
Hard-parameter sharing
Although deep neural network models are capable of learning complex non-linear relationship between input and target data, they require a large amount of well-balanced data in order to reach high performance level. Unfortunately, such abundant situations are quite rare in practice that in environmental data forecasting, for instance, datasets are not only severely imbalanced, but also scarce. Hence, this paper presents a multi-headed deep-neural network model that can effectively learn multi-region datasets mitigating data imbalance and insufficiency. The proposed architecture learns common features from multiple regions in addition to region-specific features of the target. The experimental studies show that the proposed network improves prediction performance by utilizing additional multi-region data more effectively.
1. Enhancing Environmental Data
Forecasting Performance by
Utilizing Multi-region Data with
Hard-parameter sharing
Electronics and Telecommunications Research Institute
Jang-Ho Choi, Miyoung Jang, Taewhi Lee, Jongho Won and Jiyong Kim
janghochoi@etri.re.kr
3. Time-series Forecasting
• “Time series analysis comprises methods for analyzing time
series data in order to extract meaningful statistics and other
characteristics of the data.”
• “Time series forecasting is the use of a model to predict
future values based on previously observed values.”
• “Forecasting involves taking models fit on historical data and using
them to predict future observations.”
https://en.wikipedia.org/wiki/Time_series
https://machinelearningmastery.com/time-series-forecasting/
4. Environmental Data
• Environmental data, i.e. air quality and water quality data, are
generally multivariate, time-series data.
• They are generally collected from multiple sites.
<Water/Air Quality Monitoring Stations>
5. Challenges
• Datasets are not only scarce and expensive, but also often
severely imbalanced in practice.
• Insufficient and imbalance datasets can lead to biased models,
resulting poor prediction performance.
0
10
20
30
40
50
60
70
80
90
Chl-a (mg/m3)
Histogram of Chlorophyll-a Concentration
(mg/m3, Binwidth=1)
Chlorophyll-a Concentration in Daecheong Lake (2012)
#
Chl-a concentrations
8. Hard-parameter Sharing
• In order to preserve region-specific
characteristic, we adopt hard-
parameter sharing from multi-task
learning [7].
• Hard-parameter sharing is a
technique to place common
hidden layers across multiple tasks,
while keeping task-specific layers.
Hard-parameter sharing is known
to reduce overfitting, but it is
generally restricted to tasks that
are closely related one another [8]. [8]
9. Multi-headed Networks
• Common layers fit on all data from multiple regions
• Region specific layers fit on their own regional data
<Iterative Training Process>
Common
Layers
R1 Layers
R2 Layers
R3 Layers
R4 Layers
R1
Data
R2
Data
R3
Data
R4
Data
입력 데이터
11. Algal Bloom
• “An algal bloom or algae bloom is a rapid increase or accumulation in the population of algae in freshwater or ma
rine water systems” (Wikipedia)
• For last few decades, algal bloom has been a serious problem, leading imbalance of organisms in the water.
• Severe algal bloom directly affects water quality as some of algae such as cyanobacteria produce toxic that are har
mful to other species—including humans.
YTN Joongang
12. Chlorophyll-a
• “Chlorophyll-a is a specific form of chlorophyll used in oxygenic p
hotosynthesis. It absorbs most energy from wavelengths of violet-
blue and orange-red light. It also reflects green-yellow light, and a
s such contributes to the observed green color of most plants.” (W
ikipedia)
• Korean Government has been operating the algal warning system
since 1998 based on chlorophyll-a concentration and the number
of cyanophyceae cells.
• Although using the concentration of chlorophyll-a as the primary indicator
has been debatable, it is still widely used as it can be measured at low cos
t, whereas counting cyanophyceae cells incur considerable human labor.
• Unfortunately, water quality datasets are not only scarce, but also
skewed and imbalanced due to their nature.
13. Datasets
• Water Quality Monitoring Station
(http://www.koreawqi.go.kr/)
• Daecheong, Janggye, Chungam, Namgang
• 2012/7/1 ~ 2018/3/31
• 8 Water Quality Index Variables:
• Chlorophyll-a
• Water Temperature, pH
• Total Electrical conductivity
• Dissolved Oxygen, Total Organic Carbons
• Total Nitrogen, Total Phosphorus
Class 0 Class 1 Class 2 Class 3
Chl-A 0~15mg/m3 15~25mg/m3 25~100mg/m3 100mg/m3+
JangGye 1217 255 142 0
DaeCheong 1451 164 70 2
CheongAm 794 523 535 10
NamGang 1478 156 365 5
15. Experiment Results
R2 MSE Micro Acc. Macro Acc.
Baseline
(Single Region) 0.5297 36.16 73.98% 66.72%
Baseline
(Multiple Regions) 0.4207 44.55 68.65% 49.04%
Multi-headed
(Multiple Regions) 0.5633 32.70 75.36% 70.04%
<Chlorophyll-A Prediction Results (JangGye)>
Baseline Model with Single Region (upper)
vs. Multi-head Neural Network Model (lower)
Class 0 Class 1 Class 2
Chl-A 0~15mg/m3 15~25mg/m3 25~100mg/m3
Baseline
(Single Region) 83.00% 59.26% 57.89%
Baseline
(Multiple Regions) 91.50% 34.57% 21.05%
Multi-headed
(Multiple Regions) 83.72% 58.89% 67.50%
<Performance Comparison (Class Accuracy)>
<Performance Comparison (R2, MSE, Micro and Macro Accuracy)>
16. Acknowledgement
• This work was supported by the Institute of Information &
communications Technology Planning & Evaluation (IITP)
grant funded by the Korea government(MSIT) (No. 2018-0-
00219, Space-time complex artificial intelligence blue-green
algae prediction technology based on direct-readable water
quality complex sensor and hyperspectral image).
17. References
1) Agrawal, S., Barrington, L., Bromberg, C., Burge, J., Gazen, C., and Hickey, J., “Machine Learning for Precipitation
Nowcasting from Radar Images,” arXiv:cs.CV/1912.12132, 2019
2) Kurt Hornik, “Approximation capabilities of multilayer feedforward networks,” Neural Networks Volume 4, Issue 2,
Pages 251-257, 1991
3) Choi, J., Park, J., Park, H. and Min, O. (2017), DART: Fast and Efficient Distributed Stream Processing Framework
for Internet of Things. ETRI Journal, 39: 202-212. doi:10.4218/etrij.17.2816.0109
4) Salman H. Khan et al., Cost-Sensitive Learning of Deep Feature Representations From Imbalanced Data, IEEE
Transactions on Neural Networks and Learning Systems, Vol. 29, Issue: 8 (2015)
5) N. V. Chawla et al., SMOTE: Synthetic Minority Over-sampling Technique, Journal of Artificial Intelligence
Research, Vol. 16 (2002)
6) J. Choi, J. Kim, J. Won and O. Min, "Modelling Chlorophyll-a Concentration using Deep Neural Networks
considering Extreme Data Imbalance and Skewness," 2019 21st International Conference on Advanced
Communication Technology (ICACT) doi: 10.23919/ICACT.2019.8702027.
7) Caruana, R. "Multitask learning: A knowledge-based source of inductive bias." Proceedings of the Tenth
International Conference on Machine Learning. 1993.
8) Sebastian Ruder, “An Overview of Multi-Task Learning in Deep Neural Networks,” arXiv:cs.LG/1706.05098, 2017