I am working on a machine learning project where I am creating data for a user. Data consist of his/her age, year of experience, city, type of business and any previous loan. Rules for the data are like below
If a user has good age, high experience & he is in good business and no previous loan, so loan will be provided to him
If a user has good age, low experience & he is in good business and no previous loan, so loan will not be provided to him
If a user has good age, high experience & he is in good business and previous loan, so loan will not be provided to him
So just like this I have created a csv file which has all of this data. Below is the link to csv file
https://drive.google.com/file/d/1zhKr8YR951Yp-_mC23hROy7AgJoRpF0m/view?usp=sharing
This file has data for age, experience, city (denoted by values from 2-9), type of business (denoted by value from 7-8), previous loan (denoted by 0) and final output as YES(1) or NO(0)
I am using below code to train a model and predict weather a user will be allowed loan or not
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model
data = pd.read_csv("test.csv")
data.head()
X = data[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']]
Y = data["Output"]
train = data[:(int((len(data) * 0.8)))]
test = data[(int((len(data) * 0.8))):]
regr = linear_model.LinearRegression()
train_x = np.array(train[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']])
train_y = np.array(train["Output"])
regr.fit(train_x, train_y)
test_x = np.array(test[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']])
test_y = np.array(test["Output"])
coeff_data = pd.DataFrame(regr.coef_, X.columns, columns=["Coefficients"])
print(coeff_data)
# Now let's do prediction of data:
test_x2 = np.array([[41, 13, 9, 7, 0]]) # <- Here I am using some random values to test
Y_pred = regr.predict(test_x2)
Running the above code, I get value of Y_pred
as 0.01543
or 0.884
or sometime 1.034
. I am not able to understand what this output means. Initially I though may be 0.01543
means low confidence thus loan will not be provided & 0.884
means high confidence so loan will be provided. Is that correct. Can anyone please help me understand it.
Can anyone please provide me link to basic examples of machine learning to get me started on these type of scenarios. Thanks
CodePudding user response:
No, you are doing it wrong! You have to output either 1 or 0. So, this is a classification problem, not regression. Use some classification algorithm like Logistic Regression instead of Linear Regression.
clf = linear_model.LogisticRegression()
train_x = np.array(train[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']])
train_y = np.array(train["Output"])
clf.fit(train_x, train_y)
test_x = np.array(test[['AGE', 'Experience', 'City', 'Business', 'Previous Loan']])
test_y = np.array(test["Output"])
test_x2 = np.array([[41, 13, 9, 7, 0]])
Y_pred = clf.predict(test_x2)
And delete that coeff_data
line, because it has no use. If you want to check the coefficients, then directly use this code:
clf.coef_
Check this link, it has a great explanation of loan approval with Machine Learning