Very new to python and pandas...but the issue is that my final output file isn't excluding any duplicates on the 'Customer Number'. Any suggestions on why this would be happening would be appreciated!
import numpy as np #numpy is the module wich can replace erros from huge datasets
from openpyxl import load_workbook
from openpyxl.styles import Font
df_1 = pd.read_excel('PRT Tracings 2020.xlsx', sheet_name='Export') #this is reading the excel document shifts and looks at sheet
df_2 = pd.read_excel('PRT Tracings 2021.xlsx', sheet_name='Export') #this reads the same excel document but looks at a different sheet
df_3 = pd.read_excel('PRT Tracings YTD 2022.xlsx', sheet_name='Export') #this reads a different excel file, and only has one sheet so no need to have it read a sheet
df_all = pd.concat([df_1, df_2, df_3], sort=False) #this combines the sheets from 1,2,3 and the sort function as false so our columns stay in the same order
to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this excel file combines all three sheets into one spreadsheet
df_all = df_all.replace(np.nan, 'N/A', regex=True) #replaces errors with N/A
remove = ['ORDERNUMBER', 'ORDER_TYPE', 'ORDERDATE', 'Major Code Description', 'Product_Number_And_Desc', 'Qty', 'Order_$', 'Order_List_$'] #this will remove all unwanted columns
df_all.drop(columns=remove, inplace=True)
df_all.drop_duplicates(subset=['Customer Number'], keep=False) #this will remove all duplicates from the tracing number syntax with pandas module
to_excel = df_all.to_excel('Combined_PRT_Tracings.xlsx', index=None) #this excel file combines all three sheets into one spreadsheet
wb = load_workbook('Combined_PRT_Tracings.xlsx') #we are using this to have openpyxl read the data, from the spreadsheet already created
ws = wb.active #this workbook is active
wb.save('Combined_PRT_Tracings.xlsx') ```
CodePudding user response:
You should assign the return value of df_all.drop_duplicates
to a variable or set inplace=True
to have the DataFrame contents overwritten. This is to prevent undesired changes to the original data.
Try:
df_all = df_all.drop_duplicates(subset='Customer Number', keep=False)
Or the equivalent:
df_all.drop_duplicates(subset='Customer Number', keep=False, inplace=True)
That will remove all duplicate rows from the DataFrame. If you want to keep the first or last row which contains a duplicate, change keep
to first
or last
.