Home > database >  Compute stats from a dictionary and populate a new column in a DataFrame
Compute stats from a dictionary and populate a new column in a DataFrame

Time:02-17

Below is a data frame that contains some summary information related to a few records along with a that dictionary contains more detailed information for a subset of those records. In reality the DataFrame and the dictionary contain thousands of entries but for purposes of simplicity I've limited the entries.

import pandas as pd

df = pd.DataFrame({'Date': ['2022-02-14', '2022-02-14', '2022-02-14', '2022-02-14', '2022-02-14'],
'count': [10, 10, 10, 9, 9],},
index = ['NNI', 'NVEC', 'IPA', 'LYTS', 'MYN'])

df
            Date  count
NNI   2022-02-14     10
NVEC  2022-02-14     10
IPA   2022-02-14     10
LYTS  2022-02-14      9
MYN   2022-02-14      9

dct = {'NNI' : pd.DataFrame({'s': [-1, -1, -1],
'count': [13, 11, 10]},
index =['2007-07-13', '2019-09-18', '2016-08-01']),
'NVEC' : pd.DataFrame({'s': [-1, -1, -1],
'count': [12, 10, 9]},
index =['2012-10-09', '2018-10-01', '2022-02-01'])
}

dct
{'NNI':        s  count
 2007-07-13   -1     13
 2019-09-18   -1     11
 2016-08-01   -1     10,
 'NVEC':       s  count
 2012-10-09   -1     12
 2018-10-01   -1     10
 2022-02-01   -1      9}

My question is:

How can I populate a new column in the DataFrame called ratio that searches and matches the keys in the dictionary with the index of the DataFrame and then computes the simple ratio:
A / B ?

Preferably looking for a way to accomplish this that does not rely on for loops.

where,
A = Number of occurrences the dictionary count column > the value of the DataFrame count column for a respective key/index (e.g. for NNI this would be 2),

B = Number of occurrences the dictionary counts column >= the value of the DataFrame count column for a respective key/index (e.g. for NNI this would be 3)

So the desired result is:

            Date  s  count   ratio
NNI   2022-02-14 -1     10  0.6667
NVEC  2022-02-14 -1     10  0.5000
IPA   2022-02-14 -1     10  0.0000
LYTS  2022-02-14 -1      9  0.0000
MYN   2022-02-14 -1      9  0.0000

CodePudding user response:

You could use a dict comprehension as well. Traverse dct and compare values, wrap it in a Series and assign it to df:

df = df.assign(ratio=pd.Series({k: v['count'].gt(df.loc[k, 'count']).sum() / v['count'].ge(df.loc[k, 'count']).sum() 
                                for k,v in dct.items()})).fillna(0)

Output:

            Date  count     ratio
NNI   2022-02-14     10  0.666667
NVEC  2022-02-14     10  0.500000
IPA   2022-02-14     10  0.000000
LYTS  2022-02-14      9  0.000000
MYN   2022-02-14      9  0.000000

CodePudding user response:

Here's one approach that doesn't use a for-loop:

(i) Concatenate dct and join it with df.

(ii) Calculate ratio on the joined DataFrame and concatenate it back to df.

(iii) Drop duplicates

joined = pd.concat(dct).reset_index(level=1).join(df, lsuffix='_dct', rsuffix='')
joined['ratio'] = (joined['count_dct'].gt(joined['count']).groupby(level=0).sum() 
                   / joined['count_dct'].ge(joined['count']).groupby(level=0).sum())
out = pd.concat((joined.drop(columns=['level_1','s','count_dct']).drop_duplicates(), df))
out = out[~out.index.duplicated(keep='first')].fillna(0)

Output:

            Date  count     ratio
NNI   2022-02-14   10.0  0.666667
NVEC  2022-02-14   10.0  0.500000
IPA   2022-02-14   10.0  0.000000
LYTS  2022-02-14    9.0  0.000000
MYN   2022-02-14    9.0  0.000000

That said, I think it's simpler and potentially more efficient to do it with a dict comprehension.

  • Related