I have the following data frame:
What I need is to sum the values of pageviews of each "Title" and create two new columns:
- PT 3: which is the number of views in the first 3 days, that means if the "date" column of the views is between the "Published Date" and "T 3" date.
- PT 30: which is the number of views in the first 30 days excluding the first 3 days, that means if the "date" is between T 3 and T 30.
So my final table will be like this: Post ID - Published Date - Title - Permalink - Categories - Author Name - Total Page Views (Which is the sum of pageviews without any filters) - Country - PT 3 - PT 30
thanks...
Post ID Published Date Title \
0 824821 2022-05-10 Tom Brady's net worth in 2022
1 824821 2022-05-10 Tom Brady's net worth in 2022
2 824821 2022-05-10 Tom Brady's net worth in 2022
Permalink \
0 https://clutchpoints.com/tom-bradys-net-worth-...
1 https://clutchpoints.com/tom-bradys-net-worth-...
2 https://clutchpoints.com/tom-bradys-net-worth-...
Categories Author Name T 3 T 30 \
0 Editorials|Evergreen|NFL|NFL Editorials Greg Patuto 2022-05-13 2022-06-09
1 Editorials|Evergreen|NFL|NFL Editorials Greg Patuto 2022-05-13 2022-06-09
2 Editorials|Evergreen|NFL|NFL Editorials Greg Patuto 2022-05-13 2022-06-09
country pageviews date
0 Australia 24 2022-05-26
1 India 24 2022-05-24
2 India 12 2022-05-26
CodePudding user response:
IIUC, try:
- Check if the "date" of the view is within 3 (and 30) days from the published date.
- Aggregate the views where the above conditions are True.
groupby
andsum
df["Views3"] = df["date"].le(df["T 3"]).mul(df["pageviews"]).groupby(df["Title"]).transform("sum")
df["Views30"] = df["date"].le(df["T 30"]).mul(df["pageviews"]).groupby(df["Title"]).transform("sum")
CodePudding user response:
Ok, so I doubt this is the optimal way to do it, but this is how I've solved a similar problem.
Note: You will have to convert the date columns to datetime type to do comparisons. This might solve the error from the other commenter
df['Published Date'] = pd.to_datetime(df['Published Date']).apply(lambda x: x.date())
df['date'] = pd.to_datetime(df['date']).apply(lambda x: x.date())
First, I created a dictionary for the output dataframe format:
aggregate_df = {'Post Id':[],'Published Date':[],'Title':[],'Permalink':[],'Categories':[],'Author Name':[],'Total Page Views':[],'PT 3':[],'PT 30':[]}
I then looped through every unique title in the title column, and filtered the dataframe for each title. I then appended each value to the output dictionary (it is .max() for most of them, but you could also use [0] for example, it doesn't matter which value you choose because they are identical - outside of total page views which you want the sum).
You can then further filter the temp df to only show the dates in the range you want to count, and append those sums to the output dictionary.
for title in df['Title'].unique():
_df = df.loc[(df['Title'] == title)]
aggregate_df['Post Id'].append(_df['Post_Id'].max())
aggregate_df['Published Date'].append(_df['Published Date'].max())
aggregate_df['Title'].append(_df['Title'].max())
aggregate_df['Permalink'].append(_df['Permalink'].max())
aggregate_df['Categories'].append(_df['Categories'].max())
aggregate_df['Author Name'].append(_df['Author Name'].max())
aggregate_df['Total Page Views'].append(_df['Page Views'].sum())
start_period = _df['Published Date'].max()
end_period = _df['Published Date'].max() dt.timedelta(days=3)
_df = df.loc[(df['Title'] == title) & (df['date'] >= start_period)& (df['date'] <= end_period)]
aggregate_df['PT 3'].append(_df['Page Views'].sum())
start_period = _df['Published Date'].max() dt.timedelta(days=3)
end_period = _df['Published Date'].max() dt.timedelta(days=30)
_df = df.loc[(df['Title'] == title) & (df['date'] >= start_period) & (df['date'] <= end_period)]
aggregate_df['PT 30'].append(_df['Page Views'].sum())
aggregate_df = pd.DataFrame(aggregate_df)