I have measurements of several characters (e.g., tail length) from hundreds of lizards. I have classified these lizards into 5 species based on a variety of methods and, as an additional measure of diagnosability, I would like to run a Discriminant Function Analysis (DFA). I am following this tutorial (https://www.r-bloggers.com/2021/05/linear-discriminant-analysis-in-r/), but I have some problems and questions.
Why should I set 60% of the dataset for training and 40% for testing (as in the tutorial)? How do I decide which values are best suited for my dataset? I see that results are very different when I change these values, but I didn't undertand what they really influence in the analysis.
I need to report the acuracy of the model for each species, e.g. the analysis properly assigned X% of samples identified a priori as "species A" to this species, with X% improperly assigned to "species B". I was unable to get such acuracy rates.
If anyone can help me with this (you can give examples using the iris dataset, as in the tutorial above), I would be very grateful.
CodePudding user response:
- When you create a model, it's good to split your dataset into a training set and a testing set. You want to train your model using part of your data. Then you can test that trained model on untouched test data to see if your model does well.
Iris is a relatively small dataset so you won't have many datapoints in your training set. Therefore, when you partition a different percentage of your small dataset for training, your model will be different. A larger dataset might produce more stable results.
- The author of the article covers reporting the accuracy of the model for each species under "Confusion matrix and accuracy - training/testing data".
p2 <- predict(linear, testing)$class
tab1 <- table(Predicted = p2, Actual = testing$Species)
tab1
Actual
Predicted setosa versicolor virginica
setosa 17 0 0
versicolor 0 22 0
virginica 0 1 24
sum(diag(tab1))/sum(tab1)
I think you calculate the accuracy of each type by going down the columns and seeing how many your model got correct and divide that by the total. If I'm not wrong, then it's 100% accuracy predicted for setosa (17 out of 17). Versicolor is 22 out of 23 (95.65%). Virginica is 100% (24 out of 24).
Confusion matrix metrics: https://arxiv.org/pdf/2008.05756.pdf
Example matrix explained using iris: https://www.analyticsvidhya.com/blog/2021/06/confusion-matrix-for-multi-class-classification/
Example matrix explained using fruits: https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826
If you're familiar with the tidyverse ecosystem, then you might want to check out yardstick.
NOTE: I set seed(123) and copied the training/testing code. I'm not sure why I'm getting 20 setosa in the testing data while the author gets 17 setosa.
Actual
Predicted setosa versicolor virginica
setosa 20 0 0
versicolor 0 19 1
virginica 0 1 20
CodePudding user response:
Many thanks! Regarding my first question, I often see that authors who used DFA for the same purpose I'm using seem to use the entire dataset to train the model and then test the model with the entire dataset as well. To do this instead of splitting the dataset into a training set and a testing set, what should I change in the code? I tried using as below but it didn't work.
set.seed(123)
ind <- sample(2, nrow(iris),
replace = TRUE,
prob = c(1, 1))
training <- iris[ind==1,]
testing <- iris[ind==2,]