Home > Enterprise >  GroupBy transform median with date filter pandas
GroupBy transform median with date filter pandas

Time:07-12

I have 2 dataframes:

df1:

artist_id concert_date region_id
12345 2019-10 22
33322 2018-11 44

df2:

artist_id date region_id popularity
12345 2019-10 22 76
12345 2019-11 44 23

I need to add the median of the artist's popularity (which needs to be calculated only for the last 3 months before the concert and only for the same region) from the second table to the first.

That is, the first table should look like (figures are invented, the point is not in them now): df1:

artist_id concert_date region_id popularity_median_last3month
12345 2019-10 22 55
33322 2018-11 44 44

Right now I'm using the following loop:

df1['popularity_median_last3month'] = pd.Series(dtype='int')

for i in range(len(df1)):
    df1['popularity_median_last3month'].values[i] = df2[(df2.artist_id==df1.artist_id.values[i])&(df2.region_id==df1.region_id.values[i])&(df2.date<=df1.concert_date.values[i])][-3:].popularity.median()

however, it takes too long with a large amount of data.

Please tell me how to avoid the loop

CodePudding user response:

Here's a way to do this without a python loop:

df3 = df1.merge(df2, on=['artist_id', 'region_id'])
df3 = df3[df3.date >= df3.concert_date - pd.DateOffset(months=3)]
df3 = df3.groupby(['artist_id', 'region_id', 'concert_date']).median().rename(
    columns={'popularity':'popularity_median_last3month'})
df1 = df1.join(df3, on=['artist_id', 'region_id', 'concert_date'])

Input:

df1
   artist_id concert_date  region_id
0      12345   2019-10-01         22
1      33322   2018-11-01         44
2      12345   2019-12-01         22

df2
   artist_id       date  region_id  popularity
0      12345 2019-10-01         22          76
1      12345 2019-11-01         44          23
2      12345 2019-11-01         22          50
3      12345 2019-08-01         22          68

Output:

   artist_id concert_date  region_id  popularity_median_last3month
0      12345   2019-10-01         22                          68.0
1      33322   2018-11-01         44                           NaN
2      12345   2019-12-01         22                          63.0

Explanation:

  • Use merge() to create a dataframe with one row per artist_id, region_id tuple
  • Filter this to contain only rows where the date (corresponding to the popularity data point) from df2 is within 3 months of concert_date
  • Use groupby() to get the median popularity for each artist_id, region_id tuple, and rename this column as popularity_median_last3month
  • Use join() to add the popularity_median_last3month column to df1.
  • Related