Home > Software engineering >  Sum and groupby if date is between two dates in two other columns and create new groupby data frame
Sum and groupby if date is between two dates in two other columns and create new groupby data frame

Time:06-02

I have the following data frame:

enter image description here

What I need is to sum the values of pageviews of each "Title" and create two new columns:

  1. PT 3: which is the number of views in the first 3 days, that means if the "date" column of the views is between the "Published Date" and "T 3" date.
  2. PT 30: which is the number of views in the first 30 days excluding the first 3 days, that means if the "date" is between T 3 and T 30.

So my final table will be like this: Post ID - Published Date - Title - Permalink - Categories - Author Name - Total Page Views (Which is the sum of pageviews without any filters) - Country - PT 3 - PT 30

thanks...

  Post ID Published Date                          Title  \
0   824821     2022-05-10  Tom Brady's net worth in 2022   
1   824821     2022-05-10  Tom Brady's net worth in 2022   
2   824821     2022-05-10  Tom Brady's net worth in 2022   

                                           Permalink  \
0  https://clutchpoints.com/tom-bradys-net-worth-...   
1  https://clutchpoints.com/tom-bradys-net-worth-...   
2  https://clutchpoints.com/tom-bradys-net-worth-...   

                                Categories  Author Name        T 3       T 30  \
0  Editorials|Evergreen|NFL|NFL Editorials  Greg Patuto 2022-05-13 2022-06-09   
1  Editorials|Evergreen|NFL|NFL Editorials  Greg Patuto 2022-05-13 2022-06-09   
2  Editorials|Evergreen|NFL|NFL Editorials  Greg Patuto 2022-05-13 2022-06-09   

     country  pageviews        date  
0  Australia         24  2022-05-26  
1      India         24  2022-05-24  
2      India         12  2022-05-26  

enter image description here

CodePudding user response:

IIUC, try:

  1. Check if the "date" of the view is within 3 (and 30) days from the published date.
  2. Aggregate the views where the above conditions are True.
  3. groupby and sum
df["Views3"] = df["date"].le(df["T 3"]).mul(df["pageviews"]).groupby(df["Title"]).transform("sum")
df["Views30"] = df["date"].le(df["T 30"]).mul(df["pageviews"]).groupby(df["Title"]).transform("sum")

CodePudding user response:

Ok, so I doubt this is the optimal way to do it, but this is how I've solved a similar problem.

Note: You will have to convert the date columns to datetime type to do comparisons. This might solve the error from the other commenter

df['Published Date'] = pd.to_datetime(df['Published Date']).apply(lambda x: x.date())
df['date'] = pd.to_datetime(df['date']).apply(lambda x: x.date())

First, I created a dictionary for the output dataframe format:

aggregate_df = {'Post Id':[],'Published Date':[],'Title':[],'Permalink':[],'Categories':[],'Author Name':[],'Total Page Views':[],'PT 3':[],'PT 30':[]}

I then looped through every unique title in the title column, and filtered the dataframe for each title. I then appended each value to the output dictionary (it is .max() for most of them, but you could also use [0] for example, it doesn't matter which value you choose because they are identical - outside of total page views which you want the sum).

You can then further filter the temp df to only show the dates in the range you want to count, and append those sums to the output dictionary.

for title in df['Title'].unique():
    _df = df.loc[(df['Title'] == title)]
    
    aggregate_df['Post Id'].append(_df['Post_Id'].max())
    aggregate_df['Published Date'].append(_df['Published Date'].max())
    aggregate_df['Title'].append(_df['Title'].max())
    aggregate_df['Permalink'].append(_df['Permalink'].max())
    aggregate_df['Categories'].append(_df['Categories'].max())
    aggregate_df['Author Name'].append(_df['Author Name'].max())
    aggregate_df['Total Page Views'].append(_df['Page Views'].sum())
    
    start_period = _df['Published Date'].max()
    end_period = _df['Published Date'].max()   dt.timedelta(days=3)
    
    _df = df.loc[(df['Title'] == title) & (df['date'] >= start_period)& (df['date'] <= end_period)]
    aggregate_df['PT 3'].append(_df['Page Views'].sum())
    
    start_period = _df['Published Date'].max()   dt.timedelta(days=3)
    end_period = _df['Published Date'].max()   dt.timedelta(days=30)
    
    _df = df.loc[(df['Title'] == title) & (df['date'] >= start_period) & (df['date'] <= end_period)]
    aggregate_df['PT 30'].append(_df['Page Views'].sum())

aggregate_df = pd.DataFrame(aggregate_df)
  • Related