I have a data frame where I want to replace the values '<=50K' and '>50K' in the 'Salary' column with '0' and '1' respectively. I have tried the replace function but it does not change anything. I have tried a lot of things but nothing seems to work. I am trying to do some logistic regression on the cells but the formulas do not work because of the datatype. The real data set has over 20,000 rows.
Age Workclass fnlwgt education education-num Salary 39 state-gov 455 Bachelors 13 <=50K 25 private 22 Masters 89 >50K
df['Salary']= df['Salary'].replace(['<=50K'],'0')
df['Salary']
This is the error i get when i try to do smf.logit(). See below code. I don't understand why i get an error because Age and education-num are both int64.
mod = smf.logit(formula = 'education-num ~ Age', data= dftrn)
resmod = modelAdm.fit()
ValueError: endog has evaluated to an array with multiple columns that has shape (26049, 16). This occurs when the variable converted to endog is non-numeric (e.g., bool or str).
CodePudding user response:
You can try this and for check purpose I have created a new column, you can always change the same column as well just replace new_column with column;
df[df['new_salary']=='<=50K']= 0
df[df['new_salary']=='>50K']= 1
CodePudding user response:
Regarding the first question, you should just use a single square bracket on the left side of the equation.
df['Salary']= df['Salary'].replace(['<=50K'],'0')
df['Salary']= df['Salary'].replace(['>50K'],'1')
df['Salary']
As for the second part of the question, you are naming the model as mod
but you are calling the fit function on modelAdm
.
Anyways those are 2 different questions and should be asked separately in 2 different posts.