Home > Net >  Groupby two columns on on two axis
Groupby two columns on on two axis

Time:06-17

I'd like to groupby pandas dataframe on two different columns on two different axes, however, struggling to figure it out Sample code:

import numpy as np
import pandas as pd 

x = pd.date_range("2022-01-01", "2022-06-01", freq="D")
y = np.arange(0, x.shape[0])
z = np.random.choice(["Jack", "Jul", "John"], size=x.shape[0])

df = pd.DataFrame({"Date": x, "numbers": y, "names": z})

so far I have the following solution, I cannot use .resample because then I loose all the names:

min_ = x.min()
max_ = x.max()
dt_range = pd.date_range(min_, max_, freq="W")
list_ = []
for date in dt_range:
  temp_df = df[df["Date"].dt.week == date.week]
  temp_df = temp_df.groupby("names").sum()
  list_.append(temp_df)

pd.concat(list_, axis=1)

Sample output:

       numbers  numbers  numbers  numbers  numbers  numbers  ...  numbers  numbers  numbers  numbers  numbers  numbers
names                                                        ...
Jack       0.0        7     36.0       39       53       99  ...      113      237      247      260      416      NaN
John       1.0       16     48.0       54       78       68  ...      436      233      250      262      139    726.0
Jul        NaN       12      NaN       40       51       64  ...      221      349      371      395      411    289.0

CodePudding user response:

You can use df.pivot to get this (I have added in a group by following from comments saying pivot causes an error), using the below:

df_out = (df.groupby(['names', 'Date'], as_index=False).sum()
          .pivot(index='names', columns='Date', values='numbers'))

However this will output with Date as the column names, rather than 'numbers' as in your question:

Date   2022-01-01  2022-01-02  2022-01-03  ...  2022-05-30  2022-05-31  2022-06-01
names                                      ...                                    
Jack          NaN         NaN         NaN  ...         NaN         NaN         NaN
John          0.0         1.0         2.0  ...       149.0         NaN         NaN
Jul           NaN         NaN         NaN  ...         NaN       150.0       151.0

(Note: not an exact match the the output in the question due to the random data in the df in the question).

To correct this, you can just set all the columns to be 'numbers' using the below:

df_out.columns = ['numbers']*len(df_out.columns)

       numbers  numbers  numbers  numbers  ...  numbers  numbers  numbers  numbers
names                                      ...                                    
Jack       NaN      NaN      NaN      3.0  ...      NaN      NaN      NaN      NaN
John       0.0      1.0      2.0      NaN  ...    148.0    149.0      NaN      NaN
Jul        NaN      NaN      NaN      NaN  ...      NaN      NaN    150.0    151.0
  • Related