Home > database >  Need to compare two Dataframes iteratively based on year using Pandas
Need to compare two Dataframes iteratively based on year using Pandas

Time:03-14

I have 2 Data Frames which needs to be compared iteratively and mismatch rows has to be stored in a csv. Since it has historical dates, need to perform comparison based on year. How can this be achieve in Pandas

 product_1  price_1   Date of purchase
0  computer     1200   2022-01-02
1   monitor      800   2022-01-03
2   printer      200   2022-01-04
3      desk      350   2022-01-05
  product_2  price_2   Date of purchase
0  computer      900   2022-01-02
1   monitor      800   2022-01-03
2   printer      300   2022-01-04
3      desk      350   2022-01-05

CodePudding user response:

I would use a split/merge/where

df1['Date of purchase'] = df1['Date of purchase'].apply(lambda x : x.split('-')[0])
df2['Date of purchase'] = df2['Date of purchase'].apply(lambda x : x.split('-')[0])

From there you can merge the two columns using a join or merge

After that you can use an np.where()

merge_df['Check'] = np.where(merge_df['comp_column']  != merge_df['another_comp_column'])

From there you can just look for where the comp columns didn't match

merge_df.loc[merge_df['Check'] == False]

CodePudding user response:

First, let's solve the problem for any group of dates/years. First, you could merge your data using the date and product names:

df = df1.merge(df2, left_on=["Date of purchase", "product_1"], right_on=["Date of purchase", "product_2"])
# Bonus points if you rename "product_2" and only use `on` instead of `left_on` and `right_on`

After that, you could simply use .loc to find the rows where prices do not match:

df.loc[df["price_1"] != df["price_2"]])
  product_1  price_1 Date of purchase product_2  price_2
0  computer     1200       2022-01-02  computer      900
2   printer      200       2022-01-04   printer      300

Now, you could process each year by iterating a list of years, querying only the data from that year on df1 and df2 and then using the above procedure to find the price mismatches:

# List available years
years = pd.concat([df1["Date of purchase"].dt.year, df2["Date of purchase"].dt.year], axis=0).unique()
# Rename columns for those bonus points
df1 = df1.rename(columns={"product_1": "product"})
df2 = df2.rename(columns={"product_2": "product"})
# Accumulate your rows in a new dataframe (starting from a list)
output_rows = list()
for year in years:
    # find data for this `year`
    df1_year = df1.loc[df1["Date of purchase"].dt.year == year]
    df2_year = df2.loc[df2["Date of purchase"].dt.year == year]
    # Apply the procedure described at the beginning
    df = df1_year .merge(df2_year , on=["Date of purchase", "product"])
    # Find rows where prices do no match
    mismatch_rows = df.loc[df["price_1"] != df["price_2"]]
    output_rows.append(mismatch_rows)
# Now, transform your rows into a single dataframe
output_df = pd.concat(output_rows)

Output:

    product  price_1 Date of purchase  price_2
0  computer     1200       2022-01-02      900
2   printer      200       2022-01-04      300
  • Related