Home > Enterprise >  I've been trying to compare 2 columns with similar entries from 2 different excel files
I've been trying to compare 2 columns with similar entries from 2 different excel files

Time:09-11

I'm new to pandas and I'm trying to check entries from 2 different excel sheets.

Df1-

          Date   Value Date       Narration  Cod  Debit     Credit  Balance
0  02-Apr-2021  02-Apr-2021  BY XYZ company  TRF      0    1112.00 -3244213
1  02-Apr-2022  02-Apr-2022  BY XYZ company  CLR   2424       0.00 -3244212
2  02-Apr-2023  02-Apr-2023  BY XYZ company  TRF      0    9894.00 -3244211
3  02-Apr-2024  02-Apr-2024  BY XYZ company  TRF  32234  130000.03 -3244210
4  02-Apr-2025  02-Apr-2025  BY XYZ company  TRF      0   45435.00 -3244209
5  02-Apr-2026  02-Apr-2026  BY XYZ company  CLR    983       0.00 -3244208

Df2 -

           Date Unnamed: 1       Company Name Vch Type  Debit     Credit
0 2021-04-01   TEAM XYZ  QWERTY123 (21-23)  Receipt      0   45435.00
1 2021-04-02   TEAM XYZ  QWERTY123 (21-24)  Payment  32234       0.00
2 2021-04-03   TEAM XYZ  QWERTY123 (21-25)  Receipt      0    9894.00
3 2021-04-04   TEAM XYZ  QWERTY123 (21-26)  Payment   2424  130000.03
4 2021-04-05   TEAM XYZ  QWERTY123 (21-27)  Receipt      0    1112.00

Here's what I'm trying to do. Both files have similar entries and I need to get the value that's not in either file. For ex: the value 983 (df1) has no match in df2. I want 983 in the output. Also,there can be n entries of the same value , for ex : there can be n entries of the value 9894.00 in debit column of df1 and it has to find the same value in df2. If there are 10 entries of 9894.00 in file 1 there has to be 10 entries of 9894.00 in df2 as well.

Here's what I've tried so far :

import pandas as pd
df1 = pd.read_excel(file1)
df2 = pd.read_excel(file2)
report = pd.merge(df1,df2)
print(report)

But this doesn't really get the expected output.

CodePudding user response:

If your requirement was just to find out unmatching entries, then the solution was very easy (thanks to Pandas Merging 101):

df1[["debit"]].merge(df2[["debit"]], on=["debit"], how="outer", indicator=True) \
     .query('_merge != "both"') \
     .drop("_merge", 1)

But you've the requirement to match repeated entries and report unmatched repeated entries, too. So, you need to adopt some work around. There are many ways to achieve this. Taking reference from my answer to a previous post, I'm proposing the following way:

PS - I've shown the processing for "debit" column. You can repeat it for "credit" column. As I understand from the question, processing of these two columns is not related.

Assume dataset:

df1 = pd.DataFrame(data=[(34, 56), (34, 21), (34, ), (453, 22)], columns=["debit", "credit"])
df2 = pd.DataFrame(data=[(34, 56), (453, 21), (453, 23)], columns=["debit", "credit"])

Expected output: Two unmached 34 and one 453 (under "debit" column).

Group by debit to get count of repeated values:

df1_grp = df1.groupby("debit").agg(cnt=("debit", len))
>>        cnt
>> debit     
>> 34       3
>> 453      1

df2_grp = df2.groupby("debit").agg(cnt=("debit", len))
>>        cnt
>> debit     
>> 34       1
>> 453      2

Create serial numbers for each repeated value:

df1_grp["serial_no"] = df1_grp.apply(lambda row: [i for i in range(1, row["cnt"] 1)], axis=1)
>>        cnt  serial_no
>> debit                
>> 34       3  [1, 2, 3]
>> 453      1        [1]

df2_grp["serial_no"] = df2_grp.apply(lambda row: [i for i in range(1, row["cnt"] 1)], axis=1)
>>        cnt serial_no
>> debit               
>> 34       1       [1]
>> 453      2    [1, 2]

Move serial numbers to individual rows:

df1_expl = df1_grp[["serial_no"]].explode("serial_no").dropna().reset_index()
>>    debit serial_no
>> 0     34         1
>> 1     34         2
>> 2     34         3
>> 3    453         1

df2_expl = df2_grp[["serial_no"]].explode("serial_no").dropna().reset_index()
>>    debit serial_no
>> 0     34         1
>> 1    453         1
>> 2    453         2

Result: Join the two dataframes by debit value and serial number. The result is the unmatched entries, including repeated entries:

df_result = df1_expl.merge(df2_expl, on=["debit", "serial_no"], how="outer", indicator=True) \
     .query('_merge != "both"') \
     .drop("_merge", 1)

>>    debit serial_no
>> 1     34         2
>> 2     34         3
>> 4    453         2
  • Related