Overview of tools available in python for performing data visualization (statistical, geographical, reporting, etc). Prepared for Minsk DataViz Day (October 4, 2017)
1. Data visualization tools in
Python
Roman Merkulov
Data Scientist at InData Labs
r_merkulov@indatalabs.com
merkylovecom@mail.ru
2. Content
- why dataviz is important
- dataviz libraries in python
- facets tool
- interactive maps
- Apache Superset
3. data visualization
- EDA & understanding the data
- fix data
- show insights
- models validation
- analytics & reporting
4. Plots vs descriptive statistics
Anscombe's quartet
*https://en.wikipedia.org/wiki/Anscombe%27s_quartet
5. Plots vs descriptive statistics
Anscombe's quartet
*https://en.wikipedia.org/wiki/Anscombe%27s_quartet
Property Value Accuracy
Mean of X 9 exact
Sample
variance of X
11 exact
Mean of y 7.5
2 decimal
places
Sample
variance of y
4.125 +- 0.003
Correlation
coef.
0.816
3 decimal
places
Linear
regression
y = 3.00 +
0.5x
2 decimal
places
Determ. coef. 0.67
2 decimal
places
25. Apache Superset
Who uses:
Airbnb Amino Brilliant.org Clark.de Digit Game Studios Douban
Endress+Hauser FBK - ICT center Faasos GfK Data Lab InData Labs
Maieutical Labs Qunar Shopkick Tails.com Tobii Tooploox Udemy Yahoo!
Zalando
Panoramix Caravel Superset
*https://github.com/apache/incubator-superset
Article on Superset benefits
and limitations
https://indatalabs.com/blog/data-strategy/open-
source-data-visualization-tool-superset
Roaring Elephant podcast
Episode 41
https://roaringelephant.org/2017/04/25/episode-41-
news-news-and-some-more-news/
26. Thanks for your attention!
some examples shown are available here
https://github.com/merkylove/data_visualisations_for_datathon_2017
Contacts:
r_merkulov@indatalabs.com
merkylovecom@mail.ru
https://www.linkedin.com/in/roman-merkulov-a61804a4/
Notas do Editor
The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two variables correlated and following the assumption of normality.
The second graph (top right) is not distributed normally; while a relationship between the two variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more general regression and the corresponding coefficient of determination would be more appropriate.
In the third graph (bottom left), the distribution is linear, but should have a different regression line (a robust regression would have been called for). The calculated regression is offset by the one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
Finally, the fourth graph (bottom right) shows an example when one outlier is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.
The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.[2][3][4][5][6]
The datasets are as follows. The x values are the same for the first three datasets.[1]
it's possible to generate bivariate data with a given mean, median, and correlation in any shape you like — even a dinosaur
The paper linked below describes a method of perturbing the points in a scatterplot, moving them towards a given shape while keeping the statistical summaries close to the fixed target value. The shapes include a star, and a cross, and the "DataSaurus"
designed like MatLab
many output formats
(
A lot of documentation on the website and in the mailing lists refers to the “backend” and many new users are confused by this term. matplotlib targets many different use cases and output formats. Some people use matplotlib interactively from the python shell and have plotting windows pop up when they type commands. Some people embed matplotlib into graphical user interfaces like wxpython or pygtk to build rich applications. Others use matplotlib in batch scripts to generate postscript images from some numerical simulations, and still others in web application servers to dynamically serve up graphs.
To support all of these use cases, matplotlib can target different outputs, and each of these capabilities is called a backend; the “frontend” is the user facing code, i.e., the plotting code, whereas the “backend” does all the hard work behind-the-scenes to make the figure. There are two types of backends: user interface backends (for use in pygtk, wxpython, tkinter, qt4, or macosx; also referred to as “interactive backends”) and hardcopy backends to make image files (PNG, SVG, PDF, PS; also referred to as “non-interactive backends”).
)
can reproduce any plot
well-tested, 14 year as a standard tool
I want population vs area coloured by Region
imperative and too verbose API
poor styles sometimes
poor support of webview/interactions
often slow for large and complicated data
keep matplotlib as a backend and provide domain specific APIs
pandas - dataframe object with plotting methods
seaborn - focus on statistical visualization. Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics. (more than 5 years)
ggplot is a Python implementation of the grammar of graphics. It is not intended to be a feature-for-feature port of ggplot2 for R--though there is much greatness in ggplot2, the Python world could stand to benefit from it. So there will be feature overlap, but not neccessarily mimicry (after all, R is a little weird).
cartopy:
(
Some of the key features of cartopy are:
object oriented projection definitions
point, line, polygon and image transformations between projections
integration to expose advanced mapping in matplotlib with a simple and intuitive interface
powerful vector data handling by integrating shapefile reading with Shapely capabilities
)
http://proj4.org/
http://trac.osgeo.org/geos/
networkx:
NetworkX is a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.
Features
Data structures for graphs, digraphs, and multigraphs
Many standard graph algorithms
Network structure and analysis measures
Generators for classic graphs, random graphs, and synthetic networks
Nodes can be "anything" (e.g., text, images, XML records)
Edges can hold arbitrary data (e.g., weights, time-series)
Open source 3-clause BSD license
Well tested with over 90% code coverage
Additional benefits from Python include fast prototyping, easy to teach, and multi-platform
scikit-plot
Scikit-plot is the result of an unartistic data scientist's dreadful realization that visualization is one of the most crucial components in the data science process, not just a mere afterthought.
Gaining insights is simply a lot easier when you're looking at a colored heatmap of a confusion matrix complete with class labels rather than a single-line dump of numbers enclosed in brackets. Besides, if you ever need to present your results to someone (virtually any time anybody hires you to do data science), you show them visualizations, not a bunch of numbers in Excel.
That said, there are a number of visualizations that frequently pop up in machine learning. Scikit-plot is a humble attempt to provide aesthetically-challenged programmers (such as myself) the opportunity to generate quick and beautiful graphs and plots with as little boilerplate as possible.
build an API that serializes the plot (usually JSON) that can be displayed in browser.
Bokeh is a Python interactive visualization library that targets modern web browsers for presentation. Its goal is to provide elegant, concise construction of novel graphics in the style of D3.js, and to extend this capability with high-performance interactivity over very large or streaming datasets. Bokeh can help anyone who would like to quickly and easily create interactive plots, dashboards, and data applications.
Plotly's Python graphing library makes interactive, publication-quality graphs online. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.
toyploy:
Plot types: bar plots, filled region plots, graph visualizations, image visualizations, line plots, matrix plots, numberline plots, scatter plots, tabular plots, text plots.
Styling: standard CSS, rich text with HTML markup.
Integrates with Jupyter without any need for plugins, magics, etc.
Interaction types: display interactive mouse coordinates, export figure data to CSV.
Interactive output formats: Embeddable, self-contained HTML.
Static output formats: SVG, PDF, PNG, MP4, WEBM.
Portability: single code base for Python 2.7 / Python 3.6.
Testing: greater-than-95% regression test coverage.
Main feature: easy animations
Cufflinks:This library binds the power of plotly with the flexibility of pandas for easy plotting.
ipyvolume:3d plotting for Python in the Jupyter notebook based on IPython widgets using WebGL.
Ipyvolume currenty can
Do volume rendering.
Create scatter plots (up to ~1 million glyphs).
Create quiver plots (like scatter, but with an arrow pointing in a particular direction).
Render in the Jupyter notebook, or create a standalone html page (or snippet to embed in your page).
Render in stereo, for virtual reality with Google Cardboard.
Animate in d3 style, for instance if the x coordinates or color of a scatter plots changes.
Animations / sequences, all scatter/quiver plot properties can be a list of arrays, which can represent time snapshots.
Stylable (although still basic)
Integrates with
ipywidgets for adding gui controls (sliders, button etc), see an example at the documentation homepage
bokeh by linking the selection
bqplot by linking the selection
Ipyvolume will probably, but not yet:
Render labels in latex.
Do isosurface rendering.
Do selections using mouse or touch.
Show a custom popup on hovering over a glyph.
python, R, Matlab, JS
chart, dashboard, slides
Every chart that matplotlib or MATLAB graphics can do.
Interactive charts and maps out-of-the-box.
Get started working offline.
Optional hosted sharing platform through Plotly On-Premises or Plotly Cloud.
on top of d3.js
Streaming API (paid)
community, chat, email, phone support (depends on plan)
public\private charts, dashboards, slides (depends on plan)
png, jpeg, pdf, svg, eps, html export (depends on plan)
connect to 7-18 sources (depends on plan)
Python, R, Scala, Julia
Bokeh, a Python interactive visualization library, enables beautiful and meaningful visual presentation of data in modern web browsers. With Bokeh, you can quickly and easily create interactive plots, dashboards, and data applications.
Bokeh helps provide elegant, concise construction of novel graphics in the style of D3.js, while also delivering high-performance interactivity over very large or streaming datasets.
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data. Datashader breaks the creation of images of data into 3 main steps:
Projection
Each record is projected into zero or more bins of a nominal plotting grid shape, based on a specified glyph.
Aggregation
Reductions are computed for each bin, compressing the potentially large dataset into a much smaller aggregate array.
Transformation
These aggregates are then further processed, eventually creating an image.
Using this very general pipeline, many interesting data visualizations can be created in a performant and scalable way. Datashader contains tools for easily creating these pipelines in a composable manner, using only a few lines of code. Datashader can be used on its own, but it is also designed to work as a pre-processing stage in a plotting library, allowing that library to work with much larger datasets than it would otherwise.
Datashader is a graphics pipeline system for creating meaningful representations of large datasets quickly and flexibly. Datashader breaks the creation of images into a series of explicit steps that allow computations to be done on intermediate representations. This approach allows accurate and effective visualizations to be produced automatically, and also makes it simple for data scientists to focus on particular data and relationships of interest in a principled way. Using highly optimized rendering routines written in Python but compiled to machine code using Numba, datashader makes it practical to work with extremely large datasets even on standard hardware.
https://datashader.readthedocs.io/en/latest/
Altair is a declarative statistical visualization library for Python, based on Vega-Lite.
With Altair, you can spend more time understanding your data and its meaning. Altair’s API is simple, friendly and consistent and built on top of the powerful Vega-Lite visualization grammar. This elegant simplicity produces beautiful and effective visualizations with a minimal amount of code.
Note: Altair and the underlying Vega-Lite library are under active development; new plot types and streamlined plotting interfaces will be added in future releases. Please stay tuned for developments in the coming months! – October 2016
The key idea is that you are declaring links between data columns to encoding channels, such as the x-axis, y-axis, color, etc. and the rest of the plot details are handled automatically. Building on this declarative plotting idea, a surprising number of useful plots and visualizations can be created.
One of the unique design philosophies of Altair is that it leverages the Vega-Lite specification to create “beautiful and effective visualizations with minimal amount of code.” What does this mean? The Altair site explains it well:
Altair provides a Python API for building statistical visualizations in a declarative manner. By statistical visualization we mean:
The data source is a DataFrame that consists of columns of different data types (quantitative, ordinal, nominal and date/time).
The DataFrame is in a tidy format where the rows correspond to samples and the columns correspond the observed variables.
The data is mapped to the visual properties (position, color, size, shape, faceting, etc.) using the group-by operation of Pandas and SQL.
The Altair API contains no actual visualization rendering code but instead emits JSON data structures following the Vega-Lite specification. For convenience, Altair can optionally use ipyvega to display client-side renderings seamlessly in the Jupyter notebook.
Where Altair differentiates itself from some of the other tools is that it attempts to interpret the data passed to it and make some reasonable assumptions about how to display it. By making reasonable assumptions, the user can spend more time exploring the data than trying to figure out a complex API for displaying it.
To illustrated this point, here is one very small example of where Altair differs from matplotlib when charting values. In Altair, if I plot a value like 10,000,000, it will display it as 10M whereas default matplotlib plots it in scientific notation (1.0 X 1e8). Obviously it is possible to change the value but trying to figure that out takes away from interpreting the data. You will see more of this behavior in the examples below.
The Altair documentation is an excellent series of notebooks and I encourage folks interested in learning more to check it out. Before going any further, I wanted to highlight one other unique aspect of Altair related to the data format it expects. As described above, Altair expects all of the data to be in tidy format. The general idea is that you wrangle your data into the appropriate format, then use the Altair API to perform various grouping or other data summary techniques for your specific situation. For new users, this may take some time getting used to. However, I think in the long-run it is a good skill to have and the investment in the data wrangling (if needed) will pay off in the end by enforcing a consistent process for visualizing data. If you would like to learn more, I found this article to be a good primer for using pandas to get data into the tidy format.
Vega is a visualization grammar, a declarative language for creating, saving, and sharing interactive visualization designs. With Vega, you can describe the visual appearance and interactive behavior of a visualization in a JSON format, and generate web-based views using Canvas or SVG.
Version 3.0.5
Vega provides basic building blocks for a wide variety of visualization designs: data loading and transformation, scales, map projections, axes, legends, and graphical marks such as rectangles, lines, plotting symbols, etc. Interaction techniques can be specified using reactive signals that dynamically modify a visualization in response to input event streams.
A Vega specification defines an interactive visualization in a JSON format. Specifications are parsed by Vega’s JavaScript runtime to generate both static images or interactive web-based views. Vega provides a convenient representation for computational generation of visualizations, and can serve as a foundation for new APIs and visual analysis tools.
Plotting data in the python ecosystem is a good news/bad news story. The good news is that there are a lot of options. The bad news is that there are a lot of options. Trying to figure out which ones works for you will depend on what you’re trying to accomplish. To some degree, you need to play with the tools to figure out if they will work for you. I don’t see one clear winner or clear loser.
Here are a few of my closing thoughts:
Pandas is handy for simple plots but you need to be willing to learn matplotlib to customize.
Seaborn can support some more complex visualization approaches but still requires matplotlib knowledge to tweak. The color schemes are a nice bonus.
ggplot has a lot of promise but is still going through growing pains.
bokeh is a robust tool if you want to set up your own visualization server but may be overkill for the simple scenarios.
pygal stands alone by being able to generate interactive svg graphs and png files. It is not as flexible as the matplotlib based solutions.
Plotly generates the most interactive graphs. You can save them offline and create very rich web-based visualizations.
As it stands now, I’ll continue to watch progress on the ggplot landscape and use pygal and plotly where interactivity is needed.
The power of machine learning comes from its ability to learn patterns from large amounts of data. Understanding your data is critical to building a powerful machine learning system.
Facets contains two robust visualizations to aid in understanding and analyzing machine learning datasets. Get a sense of the shape of each feature of your dataset using Facets Overview, or explore individual observations using Facets Dive.
*********************
Explore Facets Overview and Facets Dive on the UCI Census Income dataset, used for predicting whether an individual’s income exceeds $50K/yr based on their census data. The census data contains features such as age, education level and occupation for each individual.1
********************************************************
Overview takes input feature data from any number of datasets, analyzes them feature by feature and visualizes the analysis.
Overview gives users a quick understanding of the distribution of values across the features of their dataset(s). Uncover several uncommon and common issues such as unexpected feature values, missing feature values for a large number of observation, training/serving skew and train/test/validation set skew.
Facets Overview summarizes statistics for each feature and compares the training and test datasets. It becomes easy to learn the distribution of values across the 6 numeric and 9 categorical features for both datasets.
Use the “Sort by” dropdown to sort features by “Distribution distance”. This sort order brings to the top of the tables, the features that are the most different between the two datasets. “Target” becomes the first feature in the table of categorical features. The chart for this feature shows that the training and test datasets actually use slightly different labels (“>50K” for the training data and “>50K.” for test data - notice the trailing period). This helps us uncover an unexpected difference between the training data and the test data.
Dive is a tool for interactively exploring large numbers of data points at once.
Dive provides an interactive interface for exploring the relationship between data points across all of the different features of a dataset. Each individual item in the visualization represents a data point. Position items by "faceting" or bucketing them in multiple dimensions by their feature values. Success stories of Dive include the detection of classifier failure, identification of systematic errors, evaluating ground truth and potential new signals for ranking.
The Dive visualization shows each individual item in the training dataset. Clicking on an individual item reveals key/value pairs that represent the features of that record; values may be strings or numbers.
Using the menus on the left, you can change how the data is organized in order to gain insight into the dataset. Use the “Faceting” menu to do Row-based faceting” by “Education-num”. Use the “Color” menu to color by “Target”. This will show how higher levels of education are related to whether or not an individual earns more than $50K/yr.
Overview gives a high-level view of one or more data sets. It produces a visual feature-by-feature statistical analysis, and can also be used to compare statistics across two or more data sets. The tool can process both numeric and string features, including multiple instances of a number or string per feature.
Overview can help uncover issues with datasets, including the following:
Unexpected feature values
Missing feature values for a large number of examples
Training/serving skew
Training/test/validation set skew
Key aspects of the visualization are outlier detection and distribution comparison across multiple datasets. Interesting values (such as a high proportion of missing data, or very different distributions of a feature across multiple datasets) are highlighted in red. Features can be sorted by values of interest such as the number of missing values or the skew between the different datasets.
Dive is a tool for interactively exploring up to tens of thousands of multidimensional data points, allowing users to seamlessly switch between a high-level overview and low-level details. Each example is a represented as single item in the visualization and the points can be positioned by faceting/bucketing in multiple dimensions by their feature values. Combining smooth animation and zooming with faceting and filtering, Dive makes it easy to spot patterns and outliers in complex data sets.
The Facets visualizations currently work only in Chrome - Issue 9.
Disclaimer: This is not an official Google product
Note: When visualizing a large amount of data, as is done in the Dive demo Jupyter notebook, you will need to start the notebook server with an increased IOPub data rate. This can be done with the command jupyter notebook --NotebookApp.iopub_data_rate_limit=10000000.
Fun Fact: In large datasets, such as the CIFAR-10 dataset[2], a small human labelling error can easily go unnoticed. We inspected the CIFAR-10 dataset with Dive and were able to catch a frog-cat – an image of a frog that had been incorrectly labelled as a cat!
Exploration of the CIFAR-10 dataset using Facets Dive. Here we facet the ground truth labels by row and the predicted labels by column. This produces a confusion matrix view, allowing us to drill into particular kinds of misclassifications. In this particular case, the ML model incorrectly labels some small percentage of true cats as frogs. The interesting thing we find by putting the real images in the confusion matrix is that one of these "true cats" that the model predicted was a frog is actually a frog from visual inspection. With Facets Dive, we can determine that this one misclassification wasn't a true misclassification of the model, but instead incorrectly labeled data in the dataset.
We’ve gotten great value out of Facets inside of Google and are excited to share the visualizations with the world. We hope they can help you discover new and interesting things about your data that lead you to create more powerful and accurate machine learning models. And since they are open source, you can customize the visualizations for your specific needs or contribute to the project to help us all better understand our data. If you have feedback about your experience with Facets, please let us know what you think.
folium builds on the data wrangling strengths of the Python ecosystem and the mapping strengths of the Leaflet.js library. Manipulate your data in Python, then visualize it in on a Leaflet map via folium.
More than 1.5 million Instagram posts have been gathered to create this interactive infographics. All of the posts are geo-tagged so that mapping them out was possible.
The colors on the map show density and sentiments of Instagram posts across Hong Kong.
Apache Superset is a data exploration and visualization web application.
Superset provides:
An intuitive interface to explore and visualize datasets, and create interactive dashboards.
A wide array of beautiful visualizations to showcase your data.
Easy, code-free, user flows to drill down and slice and dice the data underlying exposed dashboards. The dashboards and charts acts as a starting point for deeper analysis.
A state of the art SQL editor/IDE exposing a rich metadata browser, and an easy workflow to create visualizations out of any result set.
An extensible, high granularity security model allowing intricate rules on who can access which product features and datasets. Integration with major authentication backends (database, OpenID, LDAP, OAuth, REMOTE_USER, ...)
A lightweight semantic layer, allowing to control how data sources are exposed to the user by defining dimensions and metrics
Out of the box support for most SQL-speaking databases
Deep integration with Druid allows for Superset to stay blazing fast while slicing and dicing large, realtime datasets
Fast loading dashboards with configurable caching
On top of having the ability to query your relational databases, Superset has ships with deep integration with Druid (a real time distributed column-store). When querying Druid, Superset can query humongous amounts of data on top of real time dataset. Note that Superset does not require Druid in any way to function, it's simply another database backend that it can query.
MySQL
Postgres
Vertica
Oracle
Microsoft SQL Server
SQLite
Greenplum
Firebird
MariaDB
Sybase
IBM DB2
Exasol
MonetDB
Snowflake
Redshift
more! look for the availability of a SQLAlchemy dialect for your database to find out whether it will work with Superset