I have a pandas data frame that contains a column called Country
. I have more than a million rows in my data frame.
Country
USA
Canada
Japan
India
Brazil
......
I want to create a new column called Country_Encode
, which will replace USA with 1, Canada with 2, and all others with 0 like the following.
Country Country_Encode
USA 1
Canada 2
Japan 0
India 0
Brazil 0
..................
I have tried following.
for idx, row in df.iterrows():
if (df.loc[idx, 'Country'] == USA):
df.loc[idx, 'Country_Encode'] = 1
elif (df.loc[idx, 'Country'] == Canada):
df.loc[idx, 'Country_Encode'] = 2
elif ((df.loc[idx, 'Country'] != USA) and (df.loc[idx, 'Country'] != Canada)):
df.loc[idx, 'Country_Encode'] = 0
The above solution works but it is very slow. Do you know how I can do it in a fast way? I really appreciate any help you can provide.
CodePudding user response:
Assuming no row contains two country names, you could assign values in a vectorized way using a boolean condition:
df['Country_encode'] = df['Country'].eq('USA') df['Country'].eq('Canada')*2
Output:
Country Country_encode
0 USA 1
1 Canada 2
2 Japan 0
3 India 0
4 Brazil 0
But in general, loc
is very fast:
df['Country_encode'] = 0
df.loc[df['Country'].eq('USA'), 'Country_encode'] = 1
df.loc[df['Country'].eq('Canada'), 'Country_encode'] = 2
CodePudding user response:
There are many ways to do this, the most basic one is the following:
def coding(row):
if row == "USA":
return 1
elif row== "Canada":
return 2
else:
return 0
df["Country_code"] = df["Country"].apply(coding)