Home > Software engineering >  Iterating over dataframe rows with for loop - problem with .loc
Iterating over dataframe rows with for loop - problem with .loc

Time:06-23

I have the two following dataframes (df1 and df2).

df1:

  code           name          region
0  AFG    Afghanistan     Middle East
1  NLD    Netherlands  Western Europe
2  AUT        Austria  Western Europe
3  IRQ           Iraq     Middle East
4  USA  United States   North America
5  CAD         Canada   North America

df2:

   code  year  gdp per capita
0   AFG  2010          547.35
1   NLD  2010        44851.27
2   AUT  2010         3577.10
3   IRQ  2010         4052.06
4   USA  2010        52760.00
5   CAD  2010        41155.32
6   AFG  2015          578.47
7   NLD  2015        45175.23
8   AUT  2015         3952.80
9   IRQ  2015         4688.32
10  USA  2015        56863.37
11  CAD  2015        43635.10

Instead of merging the two dataframes, I would like to add the respective region from df1 as a new column to df2, using either iterrows() or a for loop.

When I call

for i in range(len(df2)):
  region = df1.loc[(df1["code"] == df2.loc[i, "code"]), "region"]
  df2.loc[i, "region"] = region

or

for index, row in df2.iterrows():
  region = df1.loc[df1["code"] == row["code"], "region"]
  df2.loc[index, "region"] = region

I get the error message "ValueError: Incompatible indexer with Series". I think it has to do with the last .loc call df2.loc[i, "region"] = region and df2.loc[index, "region"] = region, because I am using a number and a string at the same time. But I have also done that for calculating region in the for loop and there is no error message.

Your help would be appreciated.

CodePudding user response:

Because region is a Series (not a scalar, for example), pandas thinks you want to assign it by index to df2.loc[i, "region"] and this causes the error.

In fact, you are looking to get at the singleton item in the series and assign that. You can achieve this by doing the following after initializing region: region = region.to_list()[0].

Here is a modified version of your code which should work:

for i in range(len(df2)):
    region = (df1.loc[(df1["code"] == df2.loc[i, "code"]), "region"]).to_list()[0]
    df2.loc[i, "region"] = region

Output:

   code  year  gdp per capita          region
0   AFG  2010            99.0     Middle East
1   NLD  2010            99.0  Western Europe
2   AUT  2010            99.0  Western Europe
3   IRQ  2010            99.0     Middle East
4   USA  2010            99.0   North America
5   CAD  2010            99.0   North America
6   AFG  2015            99.0     Middle East
7   NLD  2015            99.0  Western Europe
8   AUT  2015            99.0  Western Europe
9   IRQ  2015            99.0     Middle East
10  USA  2015            99.0   North America
11  CAD  2015            99.0   North America

Because you have said you do not want to "merge" the dataframes, I assume you are trying to avoid both merge() and join(), otherwise I would have suggested the following:

df2 = df2.join(df1.drop(columns='name').set_index('code'), on='code')
  • Related