The talk was delivered by Ian Rolewicz at the International Workshop on Cloud for High Performance Computing 2011 (C4HPC'11), co-located with the 2011 International Conference on Computational Science and its Applications (ICCSA 2011) .
Publication: http://bit.ly/GRBkC2
Abstract:
This document introduces the TimeCloud Front End, a web based interface for the TimeCloud platform that manages large-scale time series in the cloud. While the Back End is built upon scalable, fault tolerant distributed systems as Hadoop and HBase and takes novel approaches for faciliating data analysis over massive time series, the Front End was built as a simple and intuitive interface for viewing the data present in the cloud, both with simple tabular display and the help of various visualizations. In addition, the Front End implements model-based views and data fetch on-demand for reducing the amount of work performed at the Back End.
1. Building a Front End
Interface for a
Sensor Data Cloud
Ian Rolewicz
Semester Project, FALL 2010
Supervised by Hoyoung Jeung, Michele
Catasta & Zoltán Miklós
5. The Front End
• Web-based interface
• Main Goals:
– Display the Data
– Be user-friendly (preferably)
– Reduce the work performed at the Back End
• Implemented in Python using the Django
Framework and the YUI 2 library.
• Visualizations implemented with Protovis
7. Full Precision vs. Model-Based
• Full Precision
– Real Data
– Whole Data taken from the Back End
– Only display at the Front End
• Model-Based Approximations
– Reconstructed Data from Parameters
– Less Data retrieved from the Back End
– Reconstruction and display of the values at
the Front End
8. The Data Model
• NULLs not stored in HBase → better for sparse
data
• Column families stored in separate files
9. Performance Measures
• Testbed on a cluster of 13 Amazon EC2
servers, each having:
– 15 GB Memory
– 8 EC2 Computing Units
– 1.7 TB Storage
– 64-bit platform
• One of them: HBase Master + Front End
• 12 others: HBase Region Servers
10. Data Used for Measures
• « Worst-case » for TimeCloud
• Compress no more than 1/5 of original
data when linearly approximated
• Linear regression → in GSN, usually 99%
of compression
11. Random Reads
• 1000 random reads in approximated
dataset
• Evenly spread
• 22% improvement in query execution time
• Less data retrieved → more cache hits
14. Conclusion
• Goals achieved:
– Display the Data
– Keep it simple
– Reduce the work performed at the Back End
• Good Basis for future extensions
• Future Work
– User/Group-based managment and access
– Completion of the model-based views
– Design of additional visualizations