Comparing two pandas dataframe cells, and if equal ==, copy other content over

I am importing excel files with products and product specific data. They look like this:

dfA

EAN Code    Product Name    Color    Price
12345       AAA             xxx      9
45678       BBB             zzz      10

and dfB

EAN Code    Product Name    New Price
12-345      AAA             10
45-678      BBB             11

I am importing these as always:

dfA = pd.DataFrame (dfA, columns=["Season", "EAN Code", "Product Name", "..."] , dtype=str)

And I am merging them, since there are multiple different excel files:

dfA = pd.concat([dfA1, dfA2, dfA3, dfA4, dfA5, dfA6, dfA7, dfA8, dfA9, dfA10, dfA11], ignore_index=False)

Then I am deleting the hyphen in the EAN column, because in excel file b there is an unnecessary hyphen in the EAN number.

for col in dfB.columns:
   dfB["EAN"] = dfB["EAN"].str.replace('-', '')

So far, so good.

Now I try to search through the EAN code column of dfA and search for the same product in dfB. When there is a match, I want to copy over the new price. This worked great in my old code, although it took about 15 minutes for the script to search through a half million rows... This worked quite well in the past, but now I am trying to achieve the same, but I get an error message. This is my simplified code:

for i in dfA.index:
   for j in dfB.index:
      if dfA.loc[i, "EAN"] == dfB.loc[j, "EAN"]:
         print ("EAN", number,"times found!")
         number=number 1
         dfA.loc[i, "Price"] = dfB.loc[j, "New Price"]

The print statement is only a feedback for me, so that I see wether the script is still doing something.

Traceback (most recent call last):
  File "c:...", line 72, in <module>
    if dfA.loc[i, "EAN"] == dfB.loc[j, "EAN"]:
  File "C:...", line 1527, in __nonzero__
    raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Why am I getting this now?

CodePudding user response：

Use a merge, after setting up a common EAN Code:

out = dfA.merge(
          dfB.assign(**{'EAN Code': dfB['EAN Code'].str.replace('-', '')
                                    .astype(int) # only if dfA has an int
                     }),
          on='EAN Code')

output:

   EAN Code Product Name_x Color  Price Product Name_y  New Price
0     12345            AAA   xxx      9            AAA         10
1     45678            BBB   zzz     10            BBB         11

CodePudding user response：

Your code is trying to test a condition somewhere, it's obviously in the if-part: if dfA.loc[i, "EAN"] == dfB.loc[j, "EAN"].

dfA.loc[i, "EAN"] happens to be a Series sometimes and/or dfB.loc[j, "EAN"] too.

So the result of ... == ... is a Series, but if expects a boolean. You need to reduce this Series to a boolean using the appropriate method (they propose empty, bool(), any(), etc.)

I assume that you expect dfA.loc[i, "EAN"] to give you a scalar and not a Series but it's not always the case (idem for dfB.loc[j, "EAN"]). So check your data. How are your i and js?

Remark

I recommend you to use df.at[i, col] rather than df.loc[i, col], when you know that you should get a scalar. It's faster and will send errors if you can't get a scalar, so it will be easier to debug.