Comparing a file with a reference file with Python-CodePudding

I created fictional Data Frames that comprehends my doubt.

I have two excel spreadsheets that I want to compare, containing the data as follows:

DataFrame to compare

Product ID	Color	Brand
A	Red	Nike
A	Red	Nike
A	Red	Addidas
A	Purple	Nike
A	Red	Nike
B	Yellow	CK
B	Yellow	CK
B	Yellow	CK
B	Yellow	CK
B	Yellow	CK
C	Green	Puma
C	Green	Puma
C	Green	Puma
C	Green	Nike
C	Green	Puma

Reference DataFrame

Product ID	Color	Brand
A	Red	Nike
B	Yellow	CK
C	Green	Puma

Expected Output

Product ID	Color	Brand	Validation
A	Red	Nike	Ok
A	Red	Nike	Ok
A	Red	Addidas	Not Ok
A	Purple	Nike	Not Ok
A	Red	Nike	Ok
B	Yellow	CK	Ok
B	Yellow	CK	Ok
B	Yellow	CK	Ok
B	Yellow	CK	Ok
B	Yellow	CK	Ok
C	Green	Puma	Ok
C	Green	Puma	Ok
C	Green	Puma	Ok
C	Green	Nike	Not Ok
C	Green	Puma	Ok

Packages I am using for previous analysis: pandas, openpyxl

My question here is: How to compare files with different number of rows? I have five samples of A, that should be compared with one unique row of the reference file. I couldn't write an iteration because the counter will be different for each file. I was trying this for comparing the color column first:

for i in range(2, ref.max_row 1):  
   cell_obj_1 = ref.cell(row=i, column=1)
   for n range(2, Compare_file.max_row 1):
     cell_obj_2 = Compare_file.cell(row=j, column=1)
        while(cell_obj_1.value == cell_obj_2.value):
            if(Compare_file.cell(row=J 1, column=2).value==(ref.cell(row=I 1, column=2)).value):
                print('Ok')
            else:
                print('Not Ok')

ref = Reference DataFrame

Compare_file = DataFrame to Compare

I just printed for now but my interest is to create a new column in Compare file.

CodePudding user response：

As I understand your problem, you have a reference dataframe which contains correct product information and you want to compare this reference data against a second dataframe. In the second dataframe you want to insert a new column which indicates the result of this evaluation.
So Given the reference dataframe df_ref as shown:

Product Color   Brand
0   A   Red Nike
1   B   Yellow  CK
2   C   Green   Puma

You want to compare the reference data against the dataframe df given below:

Product Color   Brand
0   A   Red Nike
1   A   Red Nike
2   A   Red Addidas
3   A   Purple  Nike
4   A   Red Nike
5   B   Yellow  CK
6   B   Yellow  CK
7   B   Yellow  CK
8   B   Yellow  CK
9   B   Yellow  CK
10  C   Green   Puma
11  C   Green   Puma
12  C   Green   Puma
13  C   Green   Nike
14  C   Green   Puma

create a comparfunction:

def compareFrames(dx: pd.DataFrame, dref: pd.DataFrame) ->list:
    rslt = []
    print(dx.shape[0])
    for i in range(dx.shape[0]):
        dx_data = dx.iloc[i].values
        dref_data = dref[dref['Product'] == dx_data[0]].values[0]
        if dref_data[1] ==dx_data[1] and dref_data[2] == dx_data[2]:
            rslt.append('OK')
        else:
            rslt.append('Not OK')
    return rslt

Which when used as follows:

df['Check'] = compareFrames(df, df_ref)

Yields:

    Product Color   Brand   Check
0   A   Red Nike    OK
1   A   Red Nike    OK
2   A   Red Addidas Not OK
3   A   Purple  Nike    Not OK
4   A   Red Nike    OK
5   B   Yellow  CK  OK
6   B   Yellow  CK  OK
7   B   Yellow  CK  OK
8   B   Yellow  CK  OK
9   B   Yellow  CK  OK
10  C   Green   Puma    OK
11  C   Green   Puma    OK
12  C   Green   Puma    OK
13  C   Green   Nike    Not OK
14  C   Green   Puma    OK