I'm missing the second line in a ggplot, there should be test and train lines present?-CodePudding

I'm trying to use ggplot2 using R to graph a train and test curve for the iterative error rates of a neural network. There should be two lines but I'm only seeing just the test line, does anyone know what happened? It looks like when I used head(error_df) every type is labelled as test for some reason.

Edit: even with just error_df without any subsets it's still not showing the line for the training set's error, this also includes various ranges such as error_df[2500:5000, 7500:10000,]

Here's the ggplot graph:

Here's the code and this is a link to a public google spreadsheet of the data:

library(Rcpp)
library(RSNNS)
library(ggplot2)
library(plotROC)
library(tidyr)

setwd("**set working directory**")
data <- read.csv("WDBC.csv", header=T)
data <- data[,1:4]
data <- scale(data)  # normalizes the data

numHneurons3 = 3
DecTargets = decodeClassLabels(data[,4])
train.test3 <- splitForTrainingAndTest(data, DecTargets,ratio = 0.50) # split

model3_02 <- mlp(train.test3$inputsTrain, train.test3$targetsTrain,  # build model3
                 size = numHneurons3, learnFuncParams = c(0.02),maxit = 10000, 
                 inputsTest = train.test3$inputsTest, 
                 targetsTest = train.test3$targetsTest)

#--------------------------------------
#     GGPlots of the Iterative Error:
#--------------------------------------

str(model3_02)

test_error <- model3_02$IterativeTestError
train_error <- model3_02$IterativeFitError

error_df <- data.frame(iter = c(seq_along(test_error),
                                seq_along(train_error)),
                       Error = c(test_error, train_error), 
                       type = c(rep("test", length(test_error)),
                                rep("train", length(train_error))
                       ))

ggplot(error_df[5000:10000,], aes(iter, Error, color = type, each = length(test_error)))   geom_line()

Here's also a snippet of the data, model, and data frame:

> head(data, 10)
       PatientID     radius    texture   perimeter
 [1,] -0.2361973  1.0960995 -2.0715123  1.26881726
 [2,] -0.2361956  1.8282120 -0.3533215  1.68447255
 [3,]  0.4313615  1.5784992  0.4557859  1.56512598
 [4,]  0.4317407 -0.7682333  0.2535091 -0.59216612
 [5,]  0.4318215  1.7487579 -1.1508038  1.77501133
 [6,] -0.2361855 -0.4759559 -0.8346009 -0.38680772
 [7,] -0.2361809  1.1698783  0.1605082  1.13712450
 [8,]  0.4326197 -0.1184126  0.3581350 -0.07280278
 [9,] -0.2361759 -0.3198854  0.5883121 -0.18391855
[10,]  0.4329621 -0.4731182  1.1044669 -0.32919213

> str(model3_02)
List of 17
 $ nInputs              : int 4
 $ maxit                : num 10000
 $ IterativeFitError    : num [1:10000] 18838 4468 2365 1639 1278 ...
 $ IterativeTestError   : num [1:10000] 7031 3006 1916 1431 1161 ...
 $ fitted.values        : num [1:284, 1:522] 0.00386 0.00386 0.00387 0.00387 0.00386 ...
 $ fittedTestValues     : num [1:285, 1:522] 0.00387 0.00387 0.00387 0.00387 0.00387 ...
 $ nOutputs             : int 522
 - attr(*, "class")= chr [1:2] "mlp" "rsnns"

> head(error_df)  
  iter     Error type
1    1 7031.3101 test
2    2 3006.4253 test
3    3 1915.8997 test
4    4 1430.6152 test
5    5 1160.6987 test
6    6  990.2686 test

CodePudding user response：

You created a data frame (error_df) with three columns by concatenating two variable together into one column, thus the variables were filled one after the other. However, you're limiting your plot from rows 5000 to 10000 of the data.

ggplot(error_df[c(5000:10000, 15000:20000),], aes(iter, Error, color = type, each = length(test_error))) geom_line()

should show both curves.