I have this dataframe:
Date | Name | Sum |
---|---|---|
2022-12-01 | James | 10 |
2022-12-01 | Tom | 4 |
2022-12-03 | James | 5 |
2022-12-04 | Adam | 8 |
where i want to group the top names by their sum during a 7 day period (every week from Monday to Sunday)
Expected output:
CodePudding user response:
To group the names in your dataframe by the sum during a 7 day period, you can first create a new column that contains the week number for each row, using the isocalendar
method from the datetime module. The isocalendar
method returns a tuple containing the year, week number, and day of the week for a given date. You can use the week number from this tuple as a grouping key for your dataframe.
import pandas as pd
from datetime import datetime
# create a dataframe with the sample data
df = pd.DataFrame({'Date': ['2022-12-01', '2022-12-01', '2022-12-03', '2022-12-04'],
'Name': ['James', 'Tom', 'James', 'Adam'],
'Sum': [10, 4, 5, 8]})
# convert the Date column to datetime
df['Date'] = pd.to_datetime(df['Date'])
# create a new column with the week number for each row
df['Week'] = df['Date'].apply(lambda x: x.isocalendar()[1])
# group the dataframe by the Week and Name columns
grouped = df.groupby(['Week', 'Name'])
# calculate the sum for each group and sort the resulting dataframe by the Sum column
result = grouped['Sum'].sum().sort_values(ascending=False)
the result variable will contain a dataframe with the names grouped by the sum during a 7 day period. You can access the top names by calling the head method on the dataframe, like this:
top_names = result.head()
This will return the top names, sorted by their sum during the 7 day period.
CodePudding user response:
You could try the following (df
your dataframe, and column Date
with datatimes):
cal = df["Date"].dt.isocalendar()
result = (
df
.drop(columns="Date")
.assign(Week=cal.year.astype("str") "-" cal.week.astype("str").str.rjust(2, "0"))
.groupby(["Week", "Name"], as_index=False).sum()
.sort_values("Sum", ascending=False)
.assign(Top=lambda df: df.groupby("Week").transform("cumcount") 1)
.pivot(index="Top", columns="Week")
.reorder_levels([1, 0], axis=1).sort_index(level=0, axis=1)
)
The first part is similiar to what @AltunE is proposing, but their answer is missing the second part.
Result for a sample dataframe
from random import choices
from string import ascii_uppercase as names
days = pd.date_range("2022-01-01", "2022-12-31", freq="d")
df = pd.DataFrame({
"Date": sorted(choices(days, k=1_000)),
"Name": choices(names, k=1_000),
"Sum": choices(range(101), k=1_000)
})
looks like:
Week 2021-52 2022-01 2022-02 ... 2022-50 2022-51 2022-52
Name Sum Name Sum Name ... Sum Name Sum Name Sum
Top ...
1 L 61.0 U 166.0 T ... 159.0 O 174.0 X 208.0
2 D 45.0 D 63.0 U ... 157.0 E 124.0 E 146.0
3 S 43.0 K 61.0 M ... 116.0 V 108.0 W 89.0
4 T 41.0 W 51.0 E ... 94.0 C 97.0 Y 82.0
5 Z 35.0 B 45.0 Y ... 80.0 J 83.0 D 53.0
6 NaN NaN V 43.0 H ... 63.0 F 72.0 U 52.0
7 NaN NaN A 35.0 D ... 51.0 Q 70.0 S 27.0
8 NaN NaN C 34.0 S ... 48.0 S 67.0 A 26.0
9 NaN NaN H 19.0 W ... 43.0 W 64.0 B 5.0
10 NaN NaN T 9.0 G ... 40.0 R 55.0 NaN NaN
11 NaN NaN R 0.0 F ... 19.0 D 43.0 NaN NaN
12 NaN NaN NaN NaN L ... 18.0 Z 40.0 NaN NaN
13 NaN NaN NaN NaN NaN ... 6.0 N 1.0 NaN NaN
14 NaN NaN NaN NaN NaN ... 5.0 NaN NaN NaN NaN
15 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
16 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
17 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
18 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
19 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
20 NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN
If you want the Week
labels to be more like you've shown you could do something like this
start = df["Date"] - pd.to_timedelta(df["Date"].dt.weekday, unit="D")
end = start pd.Timedelta(days=7)
week = "Period " start.dt.strftime("%Y/%m/%d") " to " end.dt.strftime("%Y/%m/%d")
and then adjust the code above with
...
.assign(Week=week)
...