Create a "differences" data frame-CodePudding

I'm trying to build a program to generate "differences reports" for users in my business group.

So I have constructed data frames out of prior version and current version of files we receive. The df's look like

Prior File:

Item Code	Price
AAA	12.9
BBB	8
DDD	3.2

New file:

Item Code	Price
AAA	14.5
BBB	8
CCC	5
DDD	3.2

I want to compare these 2 DF and output a new DF that looks like this:

Old Item Code	New Item Code	Item Code Diff	Old Price	New Price	Price Diff
AAA	AAA	FALSE	12.9	14.5	TRUE
BBB	BBB	FALSE	8	8	FALSE
	CCC	TRUE		5	TRUE
DDD	DDD	FALSE	3.2	3.2	FALSE

So I am showing item code CCC is new, Item Code AAA had a price change and item codes BBB DDD are unchanged.

Currently I have DFs that match prior_file and new_file and am comparing them like this

import pandas as pd
import numpy as np

df_old_file = pd.read_excel(old_excel_file)
df_new_file = pd.read_excel(new_excel_file)

diff_cols = df_new_file.columns.values

for i in diff cols:
    df_new_file[i, 'diff'] = np.where(df_new_file[i] == df_old_file[i], 'True', 'False')

print(df_new_file.head())

I don't think this is working correctly, as any new code like CCC will cause every subsequent record to show as a diff.

Thank you!

Edit: I am now using a df.join and getting closer

df_results = df_old_file.join(df_new_file.set_index(item code), how = 'outer', on='item code', lsuffix='_old', rsuffix='_new')

print(df_results.head())

gives

Item Code	Price_old	Price_New
AAA	12.9	14.5

Now I just need to create the difference 'true', 'false' columns

CodePudding user response：

I would suggest using this kind of architecture to store the data and compare it

class singleData():
    def __init__(self, name, value):
        self.name = name
        self.value = value

class datas():
    def __init__(self):
        self.listOfData = []

    def add(self, newData): # take a data object
        self.listOfData.Append(newData)

    def nameExist(self, name):
        for data in listOfData:
            if data.name == name:
                return True
        return False

    def getValue(self, name):
        for data in listOfData:
            if data.name == name:
                return data.value
        return;

using it like this:

myDatas = datas()

# here I manually add elements but do it using panda
myDatas.add(singleData("AAA", 12.9))

#when comparing with new elements, just do a nameExist check and then get value using getValue
AAAExist = myDatas.nameExist("AAA")
if AAAExist:
    AAAOldValue = myDatas.getValue("AAA")

Hope it helps

CodePudding user response：

So I wanted to share the answer I came up with today:

import pandas as pd
import numpy as np
#read the files into DFs
df_old_file = pd.read_excel(old_excel_file)
df_new_file = pd.read_excel(new_excel_file)
#Merge DFs
df_results = pd.merge(df_old_file, df_new_file, 
how="outer", left_on="ITEM CODE", right_on="ITEM CODE", suffixes=('_old','_new')

#list of cols
diff_cols = df_new_file.columns.values
#remove key from comparison
diff_cols.remove('ITEM CODE')

#find if diff and create diff indicator column
for i in diff_cols:
    df_results[i,'diff'] = np.where(df_results[z '_old'] ==
    np.where(df_results[z '_new'], '', 'diff')

print(df_results)

This gives

ITEM CODE	PRICE_OLD	PRICE_NEW	('PRICE', 'diff')
aa	12.9	14.5	diff
bb	8	8

It's also extensible to a worksheet with any N number of columns.