Home > Net >  Apply test on binned data back to original data
Apply test on binned data back to original data

Time:12-26

I have a pandas DataFrame with several columns, one of which is time t. I know how to bin these data based on time, and apply a test to the mean values of certain variables in each time bin. I now wish to apply the results of the tests on the binned data back to the original data, ie if the time of an event falls in a bin that passes or fails, that event passes or fails. My example code is below, and I am looking for suggestions for the last step.

import numpy as np
import pandas as pd

# Generate data
t = np.random.uniform(0, 100, 1000)
a = np.random.uniform(-1, 1, len(t))
b = np.random.uniform(-5, -1, len(t))
c = np.random.uniform(10, 100, len(t))

# Create a data frame
df = pd.DataFrame(zip(t, a, b, c), columns=['t', 'a', 'b', 'c'])
df.sort_values('t', inplace=True)
print(df)

# Bin data over time with mean values
binEdges = np.linspace(0, 100, 11)
dfBinAv = df.groupby(pd.cut(df['t'], binEdges)).mean()

# Test mean values
dfBinAv['a test'] = [aa >=0 for aa in dfBinAv['a']]
dfBinAv['b test'] = [-4 <= bb for bb in dfBinAv['b']]
dfBinAv['Test'] = [aa and bb  for aa, bb in zip(dfBinAv['a test'], dfBinAv['b test'])]
print(dfBinAv)

# Add columns to original, unbinned DataFrame, containing whether or not variable t falls in a pass or fail bin for each test???
# ???
# For each line in df, 
# figure out what line of dfBinAv it maps to, 
# and bring back the values of the tests to new columns
# At the end, df should have new columns 'a test', 'b test' and 'Test'

CodePudding user response:

Just try with transform

g = df.groupby(pd.cut(df['t'], binEdges))
df['a test'] = g['a'].transform('mean')>=0
df['b test'] = g['b'].transform('mean')>=-4
df['Test'] = df['a test'] & df['b test'] 

CodePudding user response:

Assuming you need the intermediate dataframe dfBinAv, you can merge the result using a temporary "group" column:

(df.assign(group=pd.cut(df['t'], binEdges))
   .merge(dfBinAv.filter(regex='[tT]est'), left_on='group', right_index=True)
   .drop(columns='group')
)

output:

             t         a         b          c  a test  b test   Test
434   0.055110  0.809890 -3.682123  11.769749    True    True   True
333   0.067866  0.747242 -3.274282  80.942789    True    True   True
867   0.219043 -0.197045 -2.891429  77.233564    True    True   True
318   0.267738 -0.970197 -3.640299  94.792442    True    True   True
471   0.333712  0.170670 -4.085683  27.816877    True    True   True
..         ...       ...       ...        ...     ...     ...    ...
967  99.352415  0.573914 -2.966360  51.006924   False    True  False
501  99.357841  0.132028 -2.967747  28.956472   False    True  False
590  99.464089 -0.800425 -1.357792  94.359697   False    True  False
525  99.536830  0.736201 -1.096155  33.528584   False    True  False
957  99.958924 -0.092601 -3.163905  78.502521   False    True  False

[1000 rows x 7 columns]
  • Related