I'm trying to build a program to generate "differences reports" for users in my business group.
So I have constructed data frames out of prior version and current version of files we receive. The df's look like
Prior File:
Item Code | Price |
---|---|
AAA | 12.9 |
BBB | 8 |
DDD | 3.2 |
New file:
Item Code | Price |
---|---|
AAA | 14.5 |
BBB | 8 |
CCC | 5 |
DDD | 3.2 |
I want to compare these 2 DF and output a new DF that looks like this:
Old Item Code | New Item Code | Item Code Diff | Old Price | New Price | Price Diff |
---|---|---|---|---|---|
AAA | AAA | FALSE | 12.9 | 14.5 | TRUE |
BBB | BBB | FALSE | 8 | 8 | FALSE |
CCC | TRUE | 5 | TRUE | ||
DDD | DDD | FALSE | 3.2 | 3.2 | FALSE |
So I am showing item code CCC is new, Item Code AAA had a price change and item codes BBB DDD are unchanged.
Currently I have DFs that match prior_file and new_file and am comparing them like this
import pandas as pd
import numpy as np
df_old_file = pd.read_excel(old_excel_file)
df_new_file = pd.read_excel(new_excel_file)
diff_cols = df_new_file.columns.values
for i in diff cols:
df_new_file[i, 'diff'] = np.where(df_new_file[i] == df_old_file[i], 'True', 'False')
print(df_new_file.head())
I don't think this is working correctly, as any new code like CCC will cause every subsequent record to show as a diff.
Thank you!
Edit: I am now using a df.join and getting closer
df_results = df_old_file.join(df_new_file.set_index(item code), how = 'outer', on='item code', lsuffix='_old', rsuffix='_new')
print(df_results.head())
gives
Item Code | Price_old | Price_New |
---|---|---|
AAA | 12.9 | 14.5 |
Now I just need to create the difference 'true', 'false' columns
CodePudding user response:
I would suggest using this kind of architecture to store the data and compare it
class singleData():
def __init__(self, name, value):
self.name = name
self.value = value
class datas():
def __init__(self):
self.listOfData = []
def add(self, newData): # take a data object
self.listOfData.Append(newData)
def nameExist(self, name):
for data in listOfData:
if data.name == name:
return True
return False
def getValue(self, name):
for data in listOfData:
if data.name == name:
return data.value
return;
using it like this:
myDatas = datas()
# here I manually add elements but do it using panda
myDatas.add(singleData("AAA", 12.9))
#when comparing with new elements, just do a nameExist check and then get value using getValue
AAAExist = myDatas.nameExist("AAA")
if AAAExist:
AAAOldValue = myDatas.getValue("AAA")
Hope it helps
CodePudding user response:
So I wanted to share the answer I came up with today:
import pandas as pd
import numpy as np
#read the files into DFs
df_old_file = pd.read_excel(old_excel_file)
df_new_file = pd.read_excel(new_excel_file)
#Merge DFs
df_results = pd.merge(df_old_file, df_new_file,
how="outer", left_on="ITEM CODE", right_on="ITEM CODE", suffixes=('_old','_new')
#list of cols
diff_cols = df_new_file.columns.values
#remove key from comparison
diff_cols.remove('ITEM CODE')
#find if diff and create diff indicator column
for i in diff_cols:
df_results[i,'diff'] = np.where(df_results[z '_old'] ==
np.where(df_results[z '_new'], '', 'diff')
print(df_results)
This gives
ITEM CODE | PRICE_OLD | PRICE_NEW | ('PRICE', 'diff') |
---|---|---|---|
aa | 12.9 | 14.5 | diff |
bb | 8 | 8 |
It's also extensible to a worksheet with any N number of columns.