Home > Blockchain >  Lookup group quartile value based on individual value in Python DataFrame
Lookup group quartile value based on individual value in Python DataFrame

Time:09-20

I have a DataFrame df1 containing daily time-series of IDs and Scores in different countries C. For the countries, I have an additional DataFrame df2 which contains for each country 4 quartiles Q with quantile scores Q_Scores.

df1:

Date       ID  C    Score
20220102   A   US   12.6
20220103   A   US   11.3
20220104   A   US   13.2
20220105   A   US   14.5
20220102   B   US   9.8
20220103   B   US   19.8
20220104   B   US   12.3
20220105   B   US   15.1
20220102   C   GB   13.5
20220103   C   GB   14.5
20220104   C   GB   11.5
20220105   C   GB   14.8
df2:

Date      C   Q  Q_Score
20220102  US  1  10
20220103  US  2  13
20220104  US  3  16
20220105  US  4  20
20220102  GB  1  12
20220103  GB  2  13
20220104  GB  3  14
20220105  GB  4  15

I try to lookup the quartile scores Q_Score and create df3 with an additional column called Q_Scores. A specific score should lookup the next bigger quartile score from df2 for a specific country. For example:

  • 20220104 / A / US: Score = 13.2 --> next bigger quartile score on that date in the US is 16 --> Q-Score: 16
df3:

Date       ID  C    Score  Q_Score
20220102   A   US   12.6   13
20220103   A   US   11.3   13
20220104   A   US   13.2   16
20220105   A   US   14.5   16
20220102   B   US   9.8    10
20220103   B   US   19.8   20
20220104   B   US   12.3   13
20220105   B   US   15.1   16
20220102   C   GB   13.5   14
20220103   C   GB   14.5   15
20220104   C   GB   11.5   12
20220105   C   GB   14.8   15

Because the Score and Q_Score don't match, I wasn't able to do it with a simple pd.merge().

CodePudding user response:

You can use pd.merge_asof, but you need some processing:

# two data must have the same data type
df2['Q_Score'] = df2['Q_Score'].astype('float64')

# keys must be sorted
pd.merge_asof(df1.sort_values('Score'),
              df2.drop(['Date','Q'], axis=1).sort_values('Q_Score'), 
              by=['C'],
              left_on='Score', 
              right_on='Q_Score',
              direction='forward'
             ).sort_values(['ID','Date'])

Output:

        Date ID   C  Score  Q_Score
4   20220102  A  US   12.6     13.0
1   20220103  A  US   11.3     13.0
5   20220104  A  US   13.2     16.0
7   20220105  A  US   14.5     16.0
0   20220102  B  US    9.8     10.0
11  20220103  B  US   19.8     20.0
3   20220104  B  US   12.3     13.0
10  20220105  B  US   15.1     16.0
6   20220102  C  GB   13.5     14.0
8   20220103  C  GB   14.5     15.0
2   20220104  C  GB   11.5     12.0
9   20220105  C  GB   14.8     15.0
  • Related