Home > Software design >  How to encode pandas data frame column with three values fast?
How to encode pandas data frame column with three values fast?

Time:04-27

I have a pandas data frame that contains a column called Country. I have more than a million rows in my data frame.

Country
USA
Canada
Japan
India
Brazil
......

I want to create a new column called Country_Encode, which will replace USA with 1, Canada with 2, and all others with 0 like the following.

Country     Country_Encode
USA             1
Canada          2
Japan           0
India           0
Brazil          0
..................

I have tried following.

for idx, row in df.iterrows():
    if (df.loc[idx, 'Country'] == USA):
        df.loc[idx, 'Country_Encode'] = 1
    elif (df.loc[idx, 'Country'] == Canada):
        df.loc[idx, 'Country_Encode'] = 2
    elif ((df.loc[idx, 'Country'] != USA) and (df.loc[idx, 'Country'] != Canada)):
        df.loc[idx, 'Country_Encode'] = 0

The above solution works but it is very slow. Do you know how I can do it in a fast way? I really appreciate any help you can provide.

CodePudding user response:

Assuming no row contains two country names, you could assign values in a vectorized way using a boolean condition:

df['Country_encode'] = df['Country'].eq('USA')   df['Country'].eq('Canada')*2

Output:

  Country  Country_encode
0     USA               1
1  Canada               2
2   Japan               0
3   India               0
4  Brazil               0

But in general, loc is very fast:

df['Country_encode'] = 0
df.loc[df['Country'].eq('USA'), 'Country_encode'] = 1
df.loc[df['Country'].eq('Canada'), 'Country_encode'] = 2

CodePudding user response:

There are many ways to do this, the most basic one is the following:

def coding(row):
    if row == "USA":
        return 1
    elif row== "Canada":
        return 2
    else:
        return 0

df["Country_code"] = df["Country"].apply(coding)
  • Related