I have 2 dataframes:
df1:
artist_id | concert_date | region_id |
---|---|---|
12345 | 2019-10 | 22 |
33322 | 2018-11 | 44 |
df2:
artist_id | date | region_id | popularity |
---|---|---|---|
12345 | 2019-10 | 22 | 76 |
12345 | 2019-11 | 44 | 23 |
I need to add the median of the artist's popularity (which needs to be calculated only for the last 3 months before the concert and only for the same region) from the second table to the first.
That is, the first table should look like (figures are invented, the point is not in them now): df1:
artist_id | concert_date | region_id | popularity_median_last3month |
---|---|---|---|
12345 | 2019-10 | 22 | 55 |
33322 | 2018-11 | 44 | 44 |
Right now I'm using the following loop:
df1['popularity_median_last3month'] = pd.Series(dtype='int')
for i in range(len(df1)):
df1['popularity_median_last3month'].values[i] = df2[(df2.artist_id==df1.artist_id.values[i])&(df2.region_id==df1.region_id.values[i])&(df2.date<=df1.concert_date.values[i])][-3:].popularity.median()
however, it takes too long with a large amount of data.
Please tell me how to avoid the loop
CodePudding user response:
Here's a way to do this without a python loop:
df3 = df1.merge(df2, on=['artist_id', 'region_id'])
df3 = df3[df3.date >= df3.concert_date - pd.DateOffset(months=3)]
df3 = df3.groupby(['artist_id', 'region_id', 'concert_date']).median().rename(
columns={'popularity':'popularity_median_last3month'})
df1 = df1.join(df3, on=['artist_id', 'region_id', 'concert_date'])
Input:
df1
artist_id concert_date region_id
0 12345 2019-10-01 22
1 33322 2018-11-01 44
2 12345 2019-12-01 22
df2
artist_id date region_id popularity
0 12345 2019-10-01 22 76
1 12345 2019-11-01 44 23
2 12345 2019-11-01 22 50
3 12345 2019-08-01 22 68
Output:
artist_id concert_date region_id popularity_median_last3month
0 12345 2019-10-01 22 68.0
1 33322 2018-11-01 44 NaN
2 12345 2019-12-01 22 63.0
Explanation:
- Use
merge()
to create a dataframe with one row perartist_id, region_id
tuple - Filter this to contain only rows where the
date
(corresponding to thepopularity
data point) fromdf2
is within 3 months ofconcert_date
- Use
groupby()
to get the medianpopularity
for eachartist_id, region_id
tuple, and rename this column aspopularity_median_last3month
- Use
join()
to add thepopularity_median_last3month
column todf1
.