I have 2 Data Frames which needs to be compared iteratively and mismatch rows has to be stored in a csv. Since it has historical dates, need to perform comparison based on year. How can this be achieve in Pandas
product_1 price_1 Date of purchase
0 computer 1200 2022-01-02
1 monitor 800 2022-01-03
2 printer 200 2022-01-04
3 desk 350 2022-01-05
product_2 price_2 Date of purchase
0 computer 900 2022-01-02
1 monitor 800 2022-01-03
2 printer 300 2022-01-04
3 desk 350 2022-01-05
CodePudding user response:
I would use a split/merge/where
df1['Date of purchase'] = df1['Date of purchase'].apply(lambda x : x.split('-')[0])
df2['Date of purchase'] = df2['Date of purchase'].apply(lambda x : x.split('-')[0])
From there you can merge the two columns using a join or merge
After that you can use an np.where()
merge_df['Check'] = np.where(merge_df['comp_column'] != merge_df['another_comp_column'])
From there you can just look for where the comp columns didn't match
merge_df.loc[merge_df['Check'] == False]
CodePudding user response:
First, let's solve the problem for any group of dates/years. First, you could merge your data using the date and product names:
df = df1.merge(df2, left_on=["Date of purchase", "product_1"], right_on=["Date of purchase", "product_2"])
# Bonus points if you rename "product_2" and only use `on` instead of `left_on` and `right_on`
After that, you could simply use .loc
to find the rows where prices do not match:
df.loc[df["price_1"] != df["price_2"]])
product_1 price_1 Date of purchase product_2 price_2
0 computer 1200 2022-01-02 computer 900
2 printer 200 2022-01-04 printer 300
Now, you could process each year by iterating a list of years, querying only the data from that year on df1
and df2
and then using the above procedure to find the price mismatches:
# List available years
years = pd.concat([df1["Date of purchase"].dt.year, df2["Date of purchase"].dt.year], axis=0).unique()
# Rename columns for those bonus points
df1 = df1.rename(columns={"product_1": "product"})
df2 = df2.rename(columns={"product_2": "product"})
# Accumulate your rows in a new dataframe (starting from a list)
output_rows = list()
for year in years:
# find data for this `year`
df1_year = df1.loc[df1["Date of purchase"].dt.year == year]
df2_year = df2.loc[df2["Date of purchase"].dt.year == year]
# Apply the procedure described at the beginning
df = df1_year .merge(df2_year , on=["Date of purchase", "product"])
# Find rows where prices do no match
mismatch_rows = df.loc[df["price_1"] != df["price_2"]]
output_rows.append(mismatch_rows)
# Now, transform your rows into a single dataframe
output_df = pd.concat(output_rows)
Output:
product price_1 Date of purchase price_2
0 computer 1200 2022-01-02 900
2 printer 200 2022-01-04 300