How to group data by multiple columns in pandas?-CodePudding

I have a CSV file of a tea exporting company that I have imported to Jupyter and the sample output of the file is like below.

Date        Type    Weight   Quantity       Price
2016-01-01  black   bags     1734136.51     1131.30
2016-01-01  black   bulk     10722389.66    510.86
2016-01-01  green   4g_1kg   6817078.01     588.72
2016-01-01  instant 1kg_3kg  86444.50       565.91
2016-01-01  black   3kg_5kg  1003986.73     552.39

I'm expecting to get a output grouped by each month at the first column and then type the next column and then weight categories in the next and total of each quantity in the next column. Just like this.

YearMonth    Type       Weight     Quantity       Price
201601       black      bags       17341          1131.30
                        1kg_3kg    107223         510.86
                        3kg_5kg    107223         510.86
             green      bulk       107223         510.86
                        1kg_3kg    107223         510.86
                        3kg_5kg    107223         510.86
             instant    4g_1kg     6817078.01     588.72
                        3kg_5kg    107223         510.86
                        3kg_5kg    107223         510.86
201602       black      1kg_3kg    86444.50       565.91

This is the code I have tried so far but I don't know how to group by the type and weight classes together.

# to get the date and convert it to YEARMONTH format
data['Date'] = pd.to_datetime(data['Date']).dt.date
data['YearMonth'] = data['Date'].map(lambda date: 100*date.year   date.month)

# calculate the total each month
data = data.groupby(['YearMonth'])['Quantity'].sum().reset_index()
data

It shoes me something like this

    YearMonth   Quantity
0   201601      23478578.08
1   201602      22988482.20
2   201603      25909619.24
3   201604      21687809.15
4   201605      20776964.81

Is there any way I can achieve this by grouping them all together?

CodePudding user response：

You can do it by the following:

1) Create your `DataFrame`, I used the dummy data you provided but I changed 2 months in order to see the result:

# Create a dictionary of data for the DataFrame
data = {'Date': ['2016-01-01', '2016-01-01', '2016-02-01', '2016-02-02', '2016-01-01'],
        'Type': ['black', 'black', 'green', 'instant', 'black'],
        'Weight': ['bags', 'bulk', '4g_1kg', '1kg_3kg', '3kg_5kg'],
        'Quantity': [1734136.51, 10722389.66, 6817078.01, 86444.50, 1003986.73],
        'Price': [1131.30, 510.86, 588.72, 565.91, 552.39]}
# Create the DataFrame
df = pd.DataFrame(data)

2) Convert your `DataFrame.index` to `datetime` using your Date column:

df.index = pd.to_datetime(df["Date"],format='%Y-%m-%d')

3) Group and aggregate your dataframe accordingly:

df = df.groupby(by=[df.index.year, df.index.month, 'Type', 'Weight']).agg({'Quantity':'sum', 'Price':'first'})

4) Renname your `index.names` for proper names:

df.index.names = ['Year', 'Month', 'Type', 'Weight']

Result will be as follows:

 ------ ------ ------- -------- ------------ ------- 
|  Year| Month|   Type|  Weight|    Quantity|  Price|
 ------ ------ ------- -------- -------------------- 
|  2016|     1|  black| 3kg_5kg|  1003986.73| 552.39|
|      |      |       |    bags|  1734136.51| 1131.3|
|      |      |       |    bulk| 10722389.66| 510.86|
|      |     2|  green|  4g_1kg|  6817078.01| 588.72|
|      |      |instant| 1kg_3kg|    86444.50| 565.91|
 ------ ------ ------ --------- ------------ -------

I see you used `reset_index` so you can use it here as well to get the following result:

df.reset_index()

 - ------ ------ ------- -------- ------------ ------- 
| |  Year| Month|   Type|  Weight|    Quantity|  Price|
 - ------ ------ ------- -------- -------------------- 
|0|  2016|     1|  black| 3kg_5kg|  1003986.73| 552.39|
|1|  2016|     1|  black|    bags|  1734136.51| 1131.3|
|2|  2016|     1|  black|    bulk| 10722389.66| 510.86|
|3|  2016|     2|  green|  4g_1kg|  6817078.01| 588.72|
|4|  2016|     2|instant| 1kg_3kg|    86444.50| 565.91|
 - ------ ------ ------ --------- ------------ -------

Note that if you don't want to keep the old index, you can use df.reset_index(drop=True)

1) Create your DataFrame, I used the dummy data you provided but I changed 2 months in order to see the result:

2) Convert your DataFrame.index to datetime using your Date column: