GroupBy transform median with date filter pandas-CodePudding

I have 2 dataframes:

df1:

artist_id	concert_date	region_id
12345	2019-10	22
33322	2018-11	44

df2:

artist_id	date	region_id	popularity
12345	2019-10	22	76
12345	2019-11	44	23

I need to add the median of the artist's popularity (which needs to be calculated only for the last 3 months before the concert and only for the same region) from the second table to the first.

That is, the first table should look like (figures are invented, the point is not in them now): df1:

artist_id	concert_date	region_id	popularity_median_last3month
12345	2019-10	22	55
33322	2018-11	44	44

Right now I'm using the following loop:

df1['popularity_median_last3month'] = pd.Series(dtype='int')

for i in range(len(df1)):
    df1['popularity_median_last3month'].values[i] = df2[(df2.artist_id==df1.artist_id.values[i])&(df2.region_id==df1.region_id.values[i])&(df2.date<=df1.concert_date.values[i])][-3:].popularity.median()

however, it takes too long with a large amount of data.

Please tell me how to avoid the loop

CodePudding user response：

Here's a way to do this without a python loop:

df3 = df1.merge(df2, on=['artist_id', 'region_id'])
df3 = df3[df3.date >= df3.concert_date - pd.DateOffset(months=3)]
df3 = df3.groupby(['artist_id', 'region_id', 'concert_date']).median().rename(
    columns={'popularity':'popularity_median_last3month'})
df1 = df1.join(df3, on=['artist_id', 'region_id', 'concert_date'])

Input:

df1
   artist_id concert_date  region_id
0      12345   2019-10-01         22
1      33322   2018-11-01         44
2      12345   2019-12-01         22

df2
   artist_id       date  region_id  popularity
0      12345 2019-10-01         22          76
1      12345 2019-11-01         44          23
2      12345 2019-11-01         22          50
3      12345 2019-08-01         22          68

Output:

   artist_id concert_date  region_id  popularity_median_last3month
0      12345   2019-10-01         22                          68.0
1      33322   2018-11-01         44                           NaN
2      12345   2019-12-01         22                          63.0

Explanation:

Use merge() to create a dataframe with one row per artist_id, region_id tuple
Filter this to contain only rows where the date (corresponding to the popularity data point) from df2 is within 3 months of concert_date
Use groupby() to get the median popularity for each artist_id, region_id tuple, and rename this column as popularity_median_last3month
Use join() to add the popularity_median_last3month column to df1.