2. About:Me
Mohd Izhar Firdaus Ismail
- Current: Solution Architect @ ABYRES Enterprise
Technologies Sdn Bhd
- Open Source Activist & (self-proclaimed) Hacker, Open Data
Advocate, Fedora Ambassador, Data Architect, Data Engineer,
Consultant, Python Programmer, Analyst, Trainer, and bunch of
other hats ;-)
- Contributing to Open Source projects for over 8 years
- Over 6 years building systems related to data, content,
information and knowledge management
- http://linkedin.com/in/kagesenshi
3. Disclaimer:
Some people call me a data scientist,
But I don't consider myself one (yet)
(( its a personal integrity thing – Machine Learning & Stats is not (yet) my strong point ))
But I do work a lot with data: designing application, infrastructure,
algorithms, processes and pipelines for big data workload – from data
acquisition to visualization
6. Open Data Apps Around The World
What you can do with quality Open Data
(and a glimpse of what nice stuff other people have ^.^)
7. Data.gov (United States)
- One of the earliest Government Open
Data initiative
- Over 159576 dataset from all over US
government agencies (as of 14th
Aug
2015)
- NGOs such as Code For America
building apps using data from it
- Companies leveraging on data for
their own startups and business
8. Data.gov : Alternative Fuels Station Locator
Benefit / Impact:
Help individuals
locate nearby
alternative fuel
stations (electric,
hydrogen, biodiesel,
etc)
Data from:
US Department of
Energy
9. Data.gov : Climate.com
Benefit / Impact:
Help farmers plan their
farming activities based
on weather conditions
Data from:
- National Weather
Service,
- US Geological Survey
- National Aeronautics
and Space
Administration
10. Data.gov : College Affordability and Transparency Center
Benefit / Impact:
Enable students to make
informed decision on choosing
where to further their studies
based on their budget
Data from:
Department of Education –
National Center for Education
Statistics
11. Data.gov.uk (United Kingdom)
- 1st
ranking in international
Open Data Initiative (ODI)'s
Open Data Barometer
- Over 22946 dataset (as of
14th
Aug 2015)
- 378 apps (as of 14th
Aug
2015)
15. Bulk of your data
related work would
be in cleaning data
- Excel to JSON/CSV
- PDF to JSON/CSV
- Unstructured to structured
- Joining multiple data sources into one, where
joining key is not obvious
- Normalizing duplicates, errors, typos, language, etc
- Dealing with inconsistent schema of historical data
- Extracting more features of data points
- Enriching data with more useful information (eg: long,lat)
- Dealing with data that was poorly collected
- Dealing with aggregated data that is not quite useful
- Real-life data is a mess: SNAFU ;-)
16. Analytic Tools & Platform
Plenty Open Source Tools available
- Simple data and analysis can be done without the need of complex Big Data
ecosystem. A ${YourFavouriteLanguage} executable is usually more than
enough to transform, clean, explore data to get initial insights and understanding
- I speak mostly in snake language, so naturally I prefer Python stuff ;-)
– Python is a strong language in scientific computing due to its history in mathematics, its
rich open source library ecosystem, and its simplicity for rapid experimentation
– Pandas, numpy, scipy, pymapreduce, xlrd, pyexcel, scikit, luigi, vaderSentiment, etc
- D3.js is highly recommended for development of data driven visualizations for
web
– Plenty of other javascript libraries to help render beautiful diagrams
17. My Personal
Favourites :
IPython Notebook & Python libraries
Apache Zeppelin, PySpark
& Python libs
"Small" data
"Big data"
Hortonworks HDP Sandbox
(Pig, Hive, Spark, and friends)
Amazon EMR
(large cluster to crunch your data)