Home > Net >  Groupby then sum doesn't work when running on large df
Groupby then sum doesn't work when running on large df

Time:12-01

I'm trying to tidy up a long and messy csv file a bit, but my method doesn't seem to work until I tried splitting the raw data into several files. Just wondering if anyone can see what goes wrong here?

The original file looks like this, except there are 600 rows:

Code   Item       Size    Location   Available
DD2    Cap Blue   S       NY         3
DD2    Cap Blue   S       NY         6
DD2    Cap Blue   S       CA         18
DD2    Cap Blue   S       PA         20
DD3    Cap Blue   L       CA         5
DA5    Tee Red    S       NY         1
DA7    Tee White  S       PA         203
DA7    Tee White  S       PA         204

I would like to turn it into:

Code   Item       Size    Location   Available
DD2    Cap Blue   S       NY         9
                          CA         18
                          PA         20
DD3    Cap Blue   L       CA         5
DA5    Tee Red    S       NY         1
DA7    Tee White  S       PA         407

, so that I can then use the pivot_table function to make it tidy.

The method I'm using is

df2 = df.groupby(['Code', 'Item', 'Size', 'Location'])['Available'].sum()
print(df2)

However, pandas merges the values in 'Available' as the numbers are plain text, i.e. the result looks like

Code   Item       Size    Location   Available
DD2    Cap Blue   S       NY         36
                          CA         18
                          PA         20
DD3    Cap Blue   L       CA         5
DA5    Tee Red    S       NY         1
DA7    Tee White  S       PA         203204

What I can't get my head around is, if I split the data, say I only take 20 rows out and run the command, it would work perfectly.

I'm very new to python and pandas, any help is appreciated. Thanks in advance.

CodePudding user response:

Change the data-type of the Available column, e.g. by:

df2["Available"] = df2["Available"].values.astype(float)
  • Related