I'm new to pandas and I'm trying to check entries from 2 different excel sheets.
Df1-
Date Value Date Narration Cod Debit Credit Balance
0 02-Apr-2021 02-Apr-2021 BY XYZ company TRF 0 1112.00 -3244213
1 02-Apr-2022 02-Apr-2022 BY XYZ company CLR 2424 0.00 -3244212
2 02-Apr-2023 02-Apr-2023 BY XYZ company TRF 0 9894.00 -3244211
3 02-Apr-2024 02-Apr-2024 BY XYZ company TRF 32234 130000.03 -3244210
4 02-Apr-2025 02-Apr-2025 BY XYZ company TRF 0 45435.00 -3244209
5 02-Apr-2026 02-Apr-2026 BY XYZ company CLR 983 0.00 -3244208
Df2 -
Date Unnamed: 1 Company Name Vch Type Debit Credit
0 2021-04-01 TEAM XYZ QWERTY123 (21-23) Receipt 0 45435.00
1 2021-04-02 TEAM XYZ QWERTY123 (21-24) Payment 32234 0.00
2 2021-04-03 TEAM XYZ QWERTY123 (21-25) Receipt 0 9894.00
3 2021-04-04 TEAM XYZ QWERTY123 (21-26) Payment 2424 130000.03
4 2021-04-05 TEAM XYZ QWERTY123 (21-27) Receipt 0 1112.00
Here's what I'm trying to do. Both files have similar entries and I need to get the value that's not in either file. For ex: the value 983 (df1) has no match in df2. I want 983 in the output. Also,there can be n entries of the same value , for ex : there can be n entries of the value 9894.00 in debit column of df1 and it has to find the same value in df2. If there are 10 entries of 9894.00 in file 1 there has to be 10 entries of 9894.00 in df2 as well.
Here's what I've tried so far :
import pandas as pd
df1 = pd.read_excel(file1)
df2 = pd.read_excel(file2)
report = pd.merge(df1,df2)
print(report)
But this doesn't really get the expected output.
CodePudding user response:
If your requirement was just to find out unmatching entries, then the solution was very easy (thanks to Pandas Merging 101):
df1[["debit"]].merge(df2[["debit"]], on=["debit"], how="outer", indicator=True) \
.query('_merge != "both"') \
.drop("_merge", 1)
But you've the requirement to match repeated entries and report unmatched repeated entries, too. So, you need to adopt some work around. There are many ways to achieve this. Taking reference from my answer to a previous post, I'm proposing the following way:
PS - I've shown the processing for "debit" column. You can repeat it for "credit" column. As I understand from the question, processing of these two columns is not related.
Assume dataset:
df1 = pd.DataFrame(data=[(34, 56), (34, 21), (34, ), (453, 22)], columns=["debit", "credit"])
df2 = pd.DataFrame(data=[(34, 56), (453, 21), (453, 23)], columns=["debit", "credit"])
Expected output: Two unmached 34
and one 453
(under "debit" column).
Group by debit to get count of repeated values:
df1_grp = df1.groupby("debit").agg(cnt=("debit", len))
>> cnt
>> debit
>> 34 3
>> 453 1
df2_grp = df2.groupby("debit").agg(cnt=("debit", len))
>> cnt
>> debit
>> 34 1
>> 453 2
Create serial numbers for each repeated value:
df1_grp["serial_no"] = df1_grp.apply(lambda row: [i for i in range(1, row["cnt"] 1)], axis=1)
>> cnt serial_no
>> debit
>> 34 3 [1, 2, 3]
>> 453 1 [1]
df2_grp["serial_no"] = df2_grp.apply(lambda row: [i for i in range(1, row["cnt"] 1)], axis=1)
>> cnt serial_no
>> debit
>> 34 1 [1]
>> 453 2 [1, 2]
Move serial numbers to individual rows:
df1_expl = df1_grp[["serial_no"]].explode("serial_no").dropna().reset_index()
>> debit serial_no
>> 0 34 1
>> 1 34 2
>> 2 34 3
>> 3 453 1
df2_expl = df2_grp[["serial_no"]].explode("serial_no").dropna().reset_index()
>> debit serial_no
>> 0 34 1
>> 1 453 1
>> 2 453 2
Result: Join the two dataframes by debit value and serial number. The result is the unmatched entries, including repeated entries:
df_result = df1_expl.merge(df2_expl, on=["debit", "serial_no"], how="outer", indicator=True) \
.query('_merge != "both"') \
.drop("_merge", 1)
>> debit serial_no
>> 1 34 2
>> 2 34 3
>> 4 453 2