I have two pandas dataframe like this.
df = pd.DataFrame([
['ABC0876', 20200103],
['DEG9871', 20210201],
['USQ5321', 20201101],
['RCF2345', 20211004]
], columns = ['Key', 'PricingDate'])
master = pd.DataFrame([
['ABC0876', 'ZS01', 20190101, 20200130],
['ABC0876', 'ZS02', 20200101, 20200213],
['DEG9871', 'ZSD1', 20210201, 20220121],
['USQ5321', 'ZS01', 20201001, 20220908],
['RCF2345', 'ZS02', 20201004, 20211013]
], columns = ['Key', 'Condition', 'Valid_From', 'Valid_To'])
The aim is to get unique valid conditions for each key. I tried the following
def Numconditions(Key, PricingDate):
unqconds = master.loc[(master['Key']==Key) & (master['Valid_To']>=PricingDate),'Condition'].unique()
return unqconds
df['Unqconds'] = df.apply(lambda row: Numconditions(row['Key'], row['PricingDate']),axis=1)
The dataframes are huge and this runs for a long time. Can anybody suggest quick solution to this that would need less run time?
CodePudding user response:
You can merge your 2 dataframes before check condition and reduce rows into a list of unique values:
out = pd.merge(df, master, on='Key', how='left') \
.query('Valid_To >= PricingDate') \
.groupby('Key', sort=False)['Condition'].unique() \
.rename('Unqconds')
df = df.set_index('Key').join(out).reset_index()
Output:
>>> df
Key PricingDate Unqconds
0 ABC0876 20200103 [ZS01, ZS02]
1 DEG9871 20210201 [ZSD1]
2 USQ5321 20201101 [ZS01]
3 RCF2345 20211004 [ZS02]