How to split image dataset to train and test with R script-CodePudding

I have a folder which contains 8000 images. I would like to know how to split these images into a train and a test set, for example train 70% of 8000 and test 30% of 8000 in R languge. I tried: train <- sample(nrow(trainData), 0.7*nrow(trainData), replace = FALSE) but it seems not appropriate for pictures. I know in python we can do :

`from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.15)`

but in R language, i don't know.

I extract images with extract feature function

extract_feature <- function(dir_path, width, height, labelsExist = T) {
  img_size <- width * height
  
  ## List images in path
  images_names <- list.files(dir_path)
  
  if(labelsExist){
    ## Select only cats or dogs images
    catdog <- str_extract(images_names, "^(cat|dog)")
    # Set cat == 0 and dog == 1
    key <- c("cat" = 0, "dog" = 1)
    y <- key[catdog]
  }
  
  print(paste("Start processing", length(images_names), "images"))
  ## This function will resize an image, turn it into greyscale
  feature_list <- pblapply(images_names, function(imgname) {
    ## Read image
    img <- readImage(file.path(dir_path, imgname))
    ## Resize image
    img_resized <- resize(img, w = width, h = height)
    ## Set to grayscale (normalized to max)
    grayimg <- channel(img_resized, "gray")
    ## Get the image as a matrix
    img_matrix <- [email protected]
    ## Coerce to a vector (row-wise)
    img_vector <- as.vector(t(img_matrix))
    return(img_vector)
  })
  ## bind the list of vector into matrix
  feature_matrix <- do.call(rbind, feature_list)
  feature_matrix <- as.data.frame(feature_matrix)
  ## Set names
  names(feature_matrix) <- paste0("pixel", c(1:img_size))
  
  if(labelsExist){
    return(list(X = feature_matrix, y = y))
  }else{
    return(feature_matrix)
  }
}

# Takes approx. 15min
trainData <- extract_feature("/run/media/dogvscat/dogs-vs-cats/", width, height)

I would like to split this trainData to a train (70%) and a validation 30%

CodePudding user response：

The features and pixels are in a list so you will have to extract features and labels seperately. First you create a random vector of numbers form 1 to number of images (8000) using sample. You will use the vector to extract the same random pixel vectors and labels in the same order. You can than just use the [] operators on the trainData list and on the feature_matrix data frame and the y vector which are inside the trainData list.

# simulated example
# 8000 images in rows with 100 pixels per image
# we have to select images row wise
feature_matrix <- matrix(rnorm(800000), ncol = 100)
feature_matrix <- as.data.frame(feature_matrix)

# simulated labels
y = sample(0:1, size = 8000, replace = T)
trainData <- list(X = feature_matrix, y = y)

# random vector for image selection
# optional - you can add set.seed for reproducibility 
set.seed(77)
rnd = sample(1:8000, size = 0.7*8000)

# select images and labels seperately in list
train = list(trainData$X[rnd,], trainData$y[rnd])
test = list(trainData$X[-rnd,], trainData$y[-rnd])