I was working on a dataframe like below:
df:
Site Visits Temp Type
KFC 511 74 Food
KFC 565 77 Food
KFC 498 72 Food
K&G 300 75 Gas
K&G 255 71 Gas
I wanted to change 'Type' column into 0-1 variable so I could use df.corr() to check the correlation.
I tried two ways, one was to make a dictionary and make a new column:
dict = {'Food':1, 'Gas':0}
df['BinaryType'] = df['Type'].map(dict)
I was then able to use df.corr() to check correlation between 'Visits' and 'BinaryType'. Since 'Type' column contains strings, df.corr() would not show correlation between 'Visits' and 'Type'.
Second way was to use .loc:
df.loc[df['Type']=='Food','Type'] = 1
df.loc[df['Type']!=1,'Type'] = 0
Then I checked df in console, it was like below and it seemed an inplace change was made. I also checked the data type using df['Type'][0] and it read 1(I suppose it's integer):
Site Visits Temp Type
KFC 511 74 1
KFC 565 77 1
KFC 498 72 1
K&G 300 75 0
K&G 255 71 0
Here however, df.corr() would not show correlation between 'Visits' and 'Type'! It was as if this column hadn't been changed.
You can use code below to reproduce:
df = pd.DataFrame({
'Site': {0: 'KFC', 1: 'KFC', 2: 'KFC', 3: 'K&G', 4:'K&G'},
'Visits': {0: 511, 1: 565, 2: 498, 3: 300, 4:255},
'Temp': {0: 74, 1: 77, 2: 72, 3: 75, 4:71},
'Type': {0: 'Food', 1: 'Food', 2: 'Food', 3: 'Gas', 4:'Gas'}})
# 1
dict = {'Food':1, 'Gas':0}
df['BinaryType'] = df['Type'].map(dict)
df.corr()
del df['BinaryType']
# 2
df.loc[df['Type']=='Food','Type'] = 1
df.loc[df['Type']!=1,'Type'] = 0
df.corr()
Any idea on how Pandas .loc works on the background?
CodePudding user response:
As your first method is working, you can just use:
dict = {'Food':1, 'Gas':0}
df['Type'] = df['Type'].map(dict)
CodePudding user response:
Your 2nd method doesn't actually change the dtype
of the series even though the values are all ints. You can see that by doing df.dtypes
which would show the Type
column is still of object
dtype
You need to explicitly cast them to int using an .astype(int)
OR
use df['Type'] = np.where(df['Type'] == 'Food', 1, 0)
running df.corr()
after that gives
In [22]: df.corr()
Out[22]:
Visits Temp Type
Visits 1.000000 0.498462 0.976714
Temp 0.498462 1.000000 0.305888
Type 0.976714 0.305888 1.000000