I'd like to groupby pandas dataframe on two different columns on two different axes, however, struggling to figure it out Sample code:
import numpy as np
import pandas as pd
x = pd.date_range("2022-01-01", "2022-06-01", freq="D")
y = np.arange(0, x.shape[0])
z = np.random.choice(["Jack", "Jul", "John"], size=x.shape[0])
df = pd.DataFrame({"Date": x, "numbers": y, "names": z})
so far I have the following solution, I cannot use .resample
because then I loose all the names:
min_ = x.min()
max_ = x.max()
dt_range = pd.date_range(min_, max_, freq="W")
list_ = []
for date in dt_range:
temp_df = df[df["Date"].dt.week == date.week]
temp_df = temp_df.groupby("names").sum()
list_.append(temp_df)
pd.concat(list_, axis=1)
Sample output:
numbers numbers numbers numbers numbers numbers ... numbers numbers numbers numbers numbers numbers
names ...
Jack 0.0 7 36.0 39 53 99 ... 113 237 247 260 416 NaN
John 1.0 16 48.0 54 78 68 ... 436 233 250 262 139 726.0
Jul NaN 12 NaN 40 51 64 ... 221 349 371 395 411 289.0
CodePudding user response:
You can use df.pivot
to get this (I have added in a group by following from comments saying pivot causes an error), using the below:
df_out = (df.groupby(['names', 'Date'], as_index=False).sum()
.pivot(index='names', columns='Date', values='numbers'))
However this will output with Date as the column names, rather than 'numbers' as in your question:
Date 2022-01-01 2022-01-02 2022-01-03 ... 2022-05-30 2022-05-31 2022-06-01
names ...
Jack NaN NaN NaN ... NaN NaN NaN
John 0.0 1.0 2.0 ... 149.0 NaN NaN
Jul NaN NaN NaN ... NaN 150.0 151.0
(Note: not an exact match the the output in the question due to the random data in the df in the question).
To correct this, you can just set all the columns to be 'numbers' using the below:
df_out.columns = ['numbers']*len(df_out.columns)
numbers numbers numbers numbers ... numbers numbers numbers numbers
names ...
Jack NaN NaN NaN 3.0 ... NaN NaN NaN NaN
John 0.0 1.0 2.0 NaN ... 148.0 149.0 NaN NaN
Jul NaN NaN NaN NaN ... NaN NaN 150.0 151.0