The uncertain panda¶
uncertain_panda helps you with constructing uncertainties of quantities calculated on your
pandas data frames,
by applying the method of bootstrapping.
Why is the panda uncertain?¶
Have you ever calculated quantities on your pandas data frame/series and wanted to know their uncertainty? Did you ever wondered if the difference in the average of two methods is significant?
Then you want to have an uncertain panda!
uncertain_panda helps you calculate uncertainties on arbitrary quantities related to your pandas data frame
max and every other arbitrary function on pandas data frames!
You can use any measured data (e.g. from A/B testing, recorded data from an experiment or any type of tabular data)
and calculate any quantity using
uncertain_panda will give you the uncertainty on this quantity.
How to use it?¶
First, install the package
pip install uncertain_panda
Now, just import pandas from the
uncertain_panda package and prefix
unc before every calculation
to get the value with the uncertainty:
from uncertain_panda import pandas as pd series = pd.Series([1, 2, 3, 4, 5, 6, 7]) series.unc.mean()
The return value is an instance of the uncertainty
Variable from the superb uncertainties package.
As this package already knows how to calculate with uncertainties, you can use the
results as if they were normal numbers in your calculations.
series.unc.mean() + 2 * series.unc.std()
You can find some more examples in Examples.
Comparison in A/B testing¶
Suppose you have done some A/B testing with a brand new feature you want to introduce. You have measured the quality of your service before (A) and after (B) the feature introduction. The averge quality is better, but is the change significant?
A first measure for this problem might be the uncertainty of the average, so lets calculate it:
which will not only give you the two average qualities but also their uncertainties.
The development has just started and there is a lot that can still be added. Here is a list of already implemented features
Automatic calculation of uncertainties of every built in pandas function for
- data frames
- grouped data frames
using the prefix
uncbefore the function name, e.g.
In the background, it used the method of bootstrapping (see below) to calculate the uncertainties.
Calculate confidence intervals (instead of symmetric one-sigma uncertainties) or get back the basic bootstrapping distribution with
df.unc.mean().bs() # for the bootstrap distribution df.unc.mean().ci(0.3, 0.8) # for the confidence interval between 0.3 and 0.8
Opional usage of
daskfor large data samples. Enable it with
daskinstead of pandas.
Plotting functionality for uncertainties with
for a nice error-bar plot.
Full configurable bootstrapping with either using pandas built-in methods or
dask(optionally enabled). Just pass the options to your called method, e.g.
to use 300 draws in the bootstrapping.
How does it work?¶
Under the hood,
uncertain_panda is using bootstrapping for calculating the uncertainties.
Find more information on bootstrapping in Bootstrapping.
There are probably plenty of packages out there for this job, that I am not aware of.
The best known is probably the bootstrapped package.
Compared to this package,
uncertain_panda tries to automate the quantity calculation
and works for arbitrary functions.
Also, it can use
dask for the calculation.
bootstrapped on the other hand is very nice for sparse arrays, which is not (yet) implemented in