Home > Software design >  Creating a time period of top values from a python dataframe
Creating a time period of top values from a python dataframe

Time:12-06

I have this dataframe:

Date Name Sum
2022-12-01 James 10
2022-12-01 Tom 4
2022-12-03 James 5
2022-12-04 Adam 8

where i want to group the top names by their sum during a 7 day period (every week from Monday to Sunday)

Expected output:

enter image description here

CodePudding user response:

To group the names in your dataframe by the sum during a 7 day period, you can first create a new column that contains the week number for each row, using the isocalendar method from the datetime module. The isocalendar method returns a tuple containing the year, week number, and day of the week for a given date. You can use the week number from this tuple as a grouping key for your dataframe.

import pandas as pd
from datetime import datetime

# create a dataframe with the sample data
df = pd.DataFrame({'Date': ['2022-12-01', '2022-12-01', '2022-12-03', '2022-12-04'],
                   'Name': ['James', 'Tom', 'James', 'Adam'],
                   'Sum': [10, 4, 5, 8]})

# convert the Date column to datetime
df['Date'] = pd.to_datetime(df['Date'])

# create a new column with the week number for each row
df['Week'] = df['Date'].apply(lambda x: x.isocalendar()[1])

# group the dataframe by the Week and Name columns
grouped = df.groupby(['Week', 'Name'])

# calculate the sum for each group and sort the resulting dataframe by the Sum column
result = grouped['Sum'].sum().sort_values(ascending=False)

the result variable will contain a dataframe with the names grouped by the sum during a 7 day period. You can access the top names by calling the head method on the dataframe, like this:

top_names = result.head()

This will return the top names, sorted by their sum during the 7 day period.

CodePudding user response:

You could try the following (df your dataframe, and column Date with datatimes):

cal = df["Date"].dt.isocalendar()
result = (
    df
    .drop(columns="Date")
    .assign(Week=cal.year.astype("str")   "-"   cal.week.astype("str").str.rjust(2, "0"))
    .groupby(["Week", "Name"], as_index=False).sum()
    .sort_values("Sum", ascending=False)
    .assign(Top=lambda df: df.groupby("Week").transform("cumcount")   1)
    .pivot(index="Top", columns="Week")
    .reorder_levels([1, 0], axis=1).sort_index(level=0, axis=1)
)

The first part is similiar to what @AltunE is proposing, but their answer is missing the second part.

Result for a sample dataframe

from random import choices
from string import ascii_uppercase as names

days = pd.date_range("2022-01-01", "2022-12-31", freq="d")
df = pd.DataFrame({
    "Date": sorted(choices(days, k=1_000)),
    "Name": choices(names, k=1_000),
    "Sum": choices(range(101), k=1_000)
})

looks like:

Week 2021-52       2022-01        2022-02  ... 2022-50 2022-51        2022-52       
        Name   Sum    Name    Sum    Name  ...     Sum    Name    Sum    Name    Sum
Top                                        ...                                      
1          L  61.0       U  166.0       T  ...   159.0       O  174.0       X  208.0
2          D  45.0       D   63.0       U  ...   157.0       E  124.0       E  146.0
3          S  43.0       K   61.0       M  ...   116.0       V  108.0       W   89.0
4          T  41.0       W   51.0       E  ...    94.0       C   97.0       Y   82.0
5          Z  35.0       B   45.0       Y  ...    80.0       J   83.0       D   53.0
6        NaN   NaN       V   43.0       H  ...    63.0       F   72.0       U   52.0
7        NaN   NaN       A   35.0       D  ...    51.0       Q   70.0       S   27.0
8        NaN   NaN       C   34.0       S  ...    48.0       S   67.0       A   26.0
9        NaN   NaN       H   19.0       W  ...    43.0       W   64.0       B    5.0
10       NaN   NaN       T    9.0       G  ...    40.0       R   55.0     NaN    NaN
11       NaN   NaN       R    0.0       F  ...    19.0       D   43.0     NaN    NaN
12       NaN   NaN     NaN    NaN       L  ...    18.0       Z   40.0     NaN    NaN
13       NaN   NaN     NaN    NaN     NaN  ...     6.0       N    1.0     NaN    NaN
14       NaN   NaN     NaN    NaN     NaN  ...     5.0     NaN    NaN     NaN    NaN
15       NaN   NaN     NaN    NaN     NaN  ...     NaN     NaN    NaN     NaN    NaN
16       NaN   NaN     NaN    NaN     NaN  ...     NaN     NaN    NaN     NaN    NaN
17       NaN   NaN     NaN    NaN     NaN  ...     NaN     NaN    NaN     NaN    NaN
18       NaN   NaN     NaN    NaN     NaN  ...     NaN     NaN    NaN     NaN    NaN
19       NaN   NaN     NaN    NaN     NaN  ...     NaN     NaN    NaN     NaN    NaN
20       NaN   NaN     NaN    NaN     NaN  ...     NaN     NaN    NaN     NaN    NaN

If you want the Week labels to be more like you've shown you could do something like this

start = df["Date"] - pd.to_timedelta(df["Date"].dt.weekday, unit="D")
end = start   pd.Timedelta(days=7)
week = "Period "   start.dt.strftime("%Y/%m/%d")   " to "   end.dt.strftime("%Y/%m/%d")

and then adjust the code above with

    ...
    .assign(Week=week)
    ...
  • Related