Should I categorize both the test and train data?-CodePudding

An automotive service chain is launching its new grand service station this weekend. They offer to service a wide variety of cars. The current capacity of the station is to check 315 cars thoroughly per day. As an inaugural offer, they claim to freely check all cars that arrive on their launch day, and report whether they need servicing or not!

Unexpectedly, they get 450 cars. The servicemen will not work longer than the working hours, but the data analysts have to!

Can you save the day for the new service station?

How can a data scientist save the day for them? He has been given a data set, ServiceTrain.csv that contains some attributes of the car that can be easily measured and a conclusion that if a service is needed or not. Now for the cars they cannot check in detail, they measure those attributes and store them in ServiceTest.csv

I am trying to find the accuracy range (in %) of the predictions made over test data but should I do encoding of the categorical variable for both test and train data?

After applying logistic regression I think I got the observations from the resultant confusion matrix as true positive=29, true negative=94 & false positive=5, true negative=94

In the process of accuracy range prediction after couple of lines into data preparation by tagging the categorical values Yes as 1 and NO as 0

train_data = pd.read.csv("ServiceTrain.csv) 
test_data = pd.read_csv("ServiceTest.csv) 
train_data.head()

CodePudding user response：

Encoding of the categorical variable:

train_data['Service'] = train_data['Service'].map({'Yes':1,'No':0})
train_data['Service']

then separating out the input and output features of the train_data following up the size of the input and output features.

Encoding again the test_data :

test_data['Service'] = test_data['Service'].map({'Yes':1,'No':0})
test_data['Service']

then separating out the input and output features of the test_data

following up with the confusion matrix