I created fictional Data Frames that comprehends my doubt.
I have two excel spreadsheets that I want to compare, containing the data as follows:
DataFrame to compare
Product ID | Color | Brand |
---|---|---|
A | Red | Nike |
A | Red | Nike |
A | Red | Addidas |
A | Purple | Nike |
A | Red | Nike |
B | Yellow | CK |
B | Yellow | CK |
B | Yellow | CK |
B | Yellow | CK |
B | Yellow | CK |
C | Green | Puma |
C | Green | Puma |
C | Green | Puma |
C | Green | Nike |
C | Green | Puma |
Reference DataFrame
Product ID | Color | Brand |
---|---|---|
A | Red | Nike |
B | Yellow | CK |
C | Green | Puma |
Expected Output
Product ID | Color | Brand | Validation |
---|---|---|---|
A | Red | Nike | Ok |
A | Red | Nike | Ok |
A | Red | Addidas | Not Ok |
A | Purple | Nike | Not Ok |
A | Red | Nike | Ok |
B | Yellow | CK | Ok |
B | Yellow | CK | Ok |
B | Yellow | CK | Ok |
B | Yellow | CK | Ok |
B | Yellow | CK | Ok |
C | Green | Puma | Ok |
C | Green | Puma | Ok |
C | Green | Puma | Ok |
C | Green | Nike | Not Ok |
C | Green | Puma | Ok |
Packages I am using for previous analysis: pandas, openpyxl
1.
My question here is: How to compare files with different number of rows? I have five samples of A, that should be compared with one unique row of the reference file. I couldn't write an iteration because the counter will be different for each file. I was trying this for comparing the color column first:
for i in range(2, ref.max_row 1):
cell_obj_1 = ref.cell(row=i, column=1)
for n range(2, Compare_file.max_row 1):
cell_obj_2 = Compare_file.cell(row=j, column=1)
while(cell_obj_1.value == cell_obj_2.value):
if(Compare_file.cell(row=J 1, column=2).value==(ref.cell(row=I 1, column=2)).value):
print('Ok')
else:
print('Not Ok')
ref
= Reference DataFrame
Compare_file
= DataFrame to Compare
I just printed for now but my interest is to create a new column in Compare file.
CodePudding user response:
As I understand your problem, you have a reference dataframe which contains correct product information and you want to compare this reference data against a second dataframe. In the second dataframe you want to insert a new column which indicates the result of this evaluation.
So Given the reference dataframe df_ref as shown:
Product Color Brand
0 A Red Nike
1 B Yellow CK
2 C Green Puma
You want to compare the reference data against the dataframe df given below:
Product Color Brand
0 A Red Nike
1 A Red Nike
2 A Red Addidas
3 A Purple Nike
4 A Red Nike
5 B Yellow CK
6 B Yellow CK
7 B Yellow CK
8 B Yellow CK
9 B Yellow CK
10 C Green Puma
11 C Green Puma
12 C Green Puma
13 C Green Nike
14 C Green Puma
create a comparfunction:
def compareFrames(dx: pd.DataFrame, dref: pd.DataFrame) ->list:
rslt = []
print(dx.shape[0])
for i in range(dx.shape[0]):
dx_data = dx.iloc[i].values
dref_data = dref[dref['Product'] == dx_data[0]].values[0]
if dref_data[1] ==dx_data[1] and dref_data[2] == dx_data[2]:
rslt.append('OK')
else:
rslt.append('Not OK')
return rslt
Which when used as follows:
df['Check'] = compareFrames(df, df_ref)
Yields:
Product Color Brand Check
0 A Red Nike OK
1 A Red Nike OK
2 A Red Addidas Not OK
3 A Purple Nike Not OK
4 A Red Nike OK
5 B Yellow CK OK
6 B Yellow CK OK
7 B Yellow CK OK
8 B Yellow CK OK
9 B Yellow CK OK
10 C Green Puma OK
11 C Green Puma OK
12 C Green Puma OK
13 C Green Nike Not OK
14 C Green Puma OK