Below is a data frame that contains some summary information related to a few records along with a that dictionary contains more detailed information for a subset of those records. In reality the DataFrame and the dictionary contain thousands of entries but for purposes of simplicity I've limited the entries.
import pandas as pd
df = pd.DataFrame({'Date': ['2022-02-14', '2022-02-14', '2022-02-14', '2022-02-14', '2022-02-14'],
'count': [10, 10, 10, 9, 9],},
index = ['NNI', 'NVEC', 'IPA', 'LYTS', 'MYN'])
df
Date count
NNI 2022-02-14 10
NVEC 2022-02-14 10
IPA 2022-02-14 10
LYTS 2022-02-14 9
MYN 2022-02-14 9
dct = {'NNI' : pd.DataFrame({'s': [-1, -1, -1],
'count': [13, 11, 10]},
index =['2007-07-13', '2019-09-18', '2016-08-01']),
'NVEC' : pd.DataFrame({'s': [-1, -1, -1],
'count': [12, 10, 9]},
index =['2012-10-09', '2018-10-01', '2022-02-01'])
}
dct
{'NNI': s count
2007-07-13 -1 13
2019-09-18 -1 11
2016-08-01 -1 10,
'NVEC': s count
2012-10-09 -1 12
2018-10-01 -1 10
2022-02-01 -1 9}
My question is:
How can I populate a new column in the DataFrame called ratio
that searches and matches the keys in the dictionary with the index of the DataFrame and then computes the simple ratio:
A / B
?
Preferably looking for a way to accomplish this that does not rely on for
loops.
where,
A
= Number of occurrences the dictionary count
column >
the value of the DataFrame count
column for a respective key/index (e.g. for NNI this would be 2),
B
= Number of occurrences the dictionary counts
column >=
the value of the DataFrame count
column for a respective key/index (e.g. for NNI this would be 3)
So the desired result is:
Date s count ratio
NNI 2022-02-14 -1 10 0.6667
NVEC 2022-02-14 -1 10 0.5000
IPA 2022-02-14 -1 10 0.0000
LYTS 2022-02-14 -1 9 0.0000
MYN 2022-02-14 -1 9 0.0000
CodePudding user response:
You could use a dict comprehension as well. Traverse dct
and compare values, wrap it in a Series and assign
it to df
:
df = df.assign(ratio=pd.Series({k: v['count'].gt(df.loc[k, 'count']).sum() / v['count'].ge(df.loc[k, 'count']).sum()
for k,v in dct.items()})).fillna(0)
Output:
Date count ratio
NNI 2022-02-14 10 0.666667
NVEC 2022-02-14 10 0.500000
IPA 2022-02-14 10 0.000000
LYTS 2022-02-14 9 0.000000
MYN 2022-02-14 9 0.000000
CodePudding user response:
Here's one approach that doesn't use a for-loop:
(i) Concatenate dct
and join
it with df
.
(ii) Calculate ratio
on the joined DataFrame and concatenate it back to df
.
(iii) Drop duplicates
joined = pd.concat(dct).reset_index(level=1).join(df, lsuffix='_dct', rsuffix='')
joined['ratio'] = (joined['count_dct'].gt(joined['count']).groupby(level=0).sum()
/ joined['count_dct'].ge(joined['count']).groupby(level=0).sum())
out = pd.concat((joined.drop(columns=['level_1','s','count_dct']).drop_duplicates(), df))
out = out[~out.index.duplicated(keep='first')].fillna(0)
Output:
Date count ratio
NNI 2022-02-14 10.0 0.666667
NVEC 2022-02-14 10.0 0.500000
IPA 2022-02-14 10.0 0.000000
LYTS 2022-02-14 9.0 0.000000
MYN 2022-02-14 9.0 0.000000
That said, I think it's simpler and potentially more efficient to do it with a dict comprehension.