The uncertain panda¶

uncertain_panda helps you with constructing uncertainties of quantities calculated on your pandas data frames, by applying the method of bootstrapping.

https://readthedocs.org/projects/uncertain-panda/badge/?version=latest&style=flat-sphare

https://travis-ci.org/nils-braun/uncertain_panda.svg?branch=master

Content¶

Why is the panda uncertain?¶

Have you ever calculated quantities on your pandas data frame/series and wanted to know their uncertainty? Did you ever wondered if the difference in the average of two methods is significant?

Then you want to have an uncertain panda!

uncertain_panda helps you calculate uncertainties on arbitrary quantities related to your pandas data frame e.g. mean, median, quantile or min/max and every other arbitrary function on pandas data frames!

You can use any measured data (e.g. from A/B testing, recorded data from an experiment or any type of tabular data) and calculate any quantity using pandas and uncertain_panda will give you the uncertainty on this quantity.

How to use it?¶

First, install the package

pip install uncertain_panda

Now, just import pandas from the uncertain_panda package and prefix unc before every calculation to get the value with the uncertainty:

from uncertain_panda import pandas as pd

series = pd.Series([1, 2, 3, 4, 5, 6, 7])
series.unc.mean()

That’s it! The return value is an instance of the uncertainty Variable from the superb uncertainties package. As this package already knows how to calculate with uncertainties, you can use the results as if they were normal numbers in your calculations.

series.unc.mean() + 2 * series.unc.std()

Super easy!

You can find some more examples in Examples.

Comparison in A/B testing¶

Suppose you have done some A/B testing with a brand new feature you want to introduce. You have measured the quality of your service before (A) and after (B) the feature introduction. The averge quality is better, but is the change significant?

A first measure for this problem might be the uncertainty of the average, so lets calculate it:

data_frame.groupby("feature_introduced").quality.unc.mean()

which will not only give you the two average qualities but also their uncertainties.

Features¶

The development has just started and there is a lot that can still be added. Here is a list of already implemented features

Automatic calculation of uncertainties of every built in pandas function for
- data frames
- series
- grouped data frames
using the prefix unc before the function name, e.g.
```
df.unc.mean()
```
In the background, it used the method of bootstrapping (see below) to calculate the uncertainties.

Calculate confidence intervals (instead of symmetric one-sigma uncertainties) or get back the basic bootstrapping distribution with

df.unc.mean().bs()  # for the bootstrap distribution
df.unc.mean().ci(0.3, 0.8)  # for the confidence interval between 0.3 and 0.8

Opional usage of dask for large data samples. Enable it with
```
df.unc.mean(pandas=False)
```
to use dask instead of pandas.
Plotting functionality for uncertainties with
```
df.unc.mean().plot_with_uncertainties(kind="bar")
```
for a nice error-bar plot.
Full configurable bootstrapping with either using pandas built-in methods or dask (optionally enabled). Just pass the options to your called method, e.g.
```
df.unc.mean(number_of_draws=300)
```
to use 300 draws in the bootstrapping.

How does it work?¶

Under the hood, uncertain_panda is using bootstrapping for calculating the uncertainties. Find more information on bootstrapping in Bootstrapping.

Other packages¶

There are probably plenty of packages out there for this job, that I am not aware of. The best known is probably the bootstrapped package. Compared to this package, uncertain_panda tries to automate the quantity calculation and works for arbitrary functions. Also, it can use dask for the calculation. bootstrapped on the other hand is very nice for sparse arrays, which is not (yet) implemented in uncertain_panda.