To make my code faster, I want to switch from Excel input to CSV input data. First, I create two df's that are exactly the same.
demand_data = pd.ExcelFile("Input Data\Historical Demand.xlsx")
FY20 = pd.read_excel(demand_data, 'Data FY20')
FY20b = pd.read_csv("Input Data\Historical Demand FY20.csv")
The resulting df's are: Based on Excel Based on CSV
Next, I want to group my df by some columns using pandas groupby and sum over some colums. I use the following code:
FY20 = FY20.groupby(['SKU', 'Material', 'Plant'])[["OrderQuantity","DeliveredQuantity"]].sum().reset_index()
FY20b = FY20b.groupby(['SKU', 'Material', 'Plant'])[["OrderQuantity","DeliveredQuantity"]].sum().reset_index()
This is the result: Result based on Excel DF Result based on CSV DF
This does not make any sense to me, since the two dataframes are exactly the same, but the result is not. How do I get the same groupby result from the CSV based dataframe?
CodePudding user response:
This probably because your numbers in the CSV are strings (e.g. 1,516.0
), you can notice that from the commas. You'll need to remove those and then convert the resulting strings to integer columns:
FY20b["OrderQuantity"] = FY20b["OrderQuantity"].apply(lambda x: x.replace(',', ''))
FY20b["OrderQuantity"] = pd.to_numeric(FY20b["OrderQuantity"])
FY20b["DeliveredQuantity"] = FY20b["DeliveredQuantity"].apply(lambda x: x.replace(',', ''))
FY20b["DeliveredQuantity"] = pd.to_numeric(FY20b["DeliveredQuantity"])
Then you can do:
FY20b = FY20b.groupby(['SKU', 'Material', 'Plant'])[["OrderQuantity","DeliveredQuantity"]].sum().reset_index()