I have a folder which contains 8000 images. I would like to know how to split these images into a train and a test set, for example train 70% of 8000 and test 30% of 8000 in R languge. I tried: train <- sample(nrow(trainData), 0.7*nrow(trainData), replace = FALSE)
but it seems not appropriate for pictures. I know in python we can do :
`from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size=0.15)`
but in R language, i don't know.
I extract images with extract feature function
extract_feature <- function(dir_path, width, height, labelsExist = T) {
img_size <- width * height
## List images in path
images_names <- list.files(dir_path)
if(labelsExist){
## Select only cats or dogs images
catdog <- str_extract(images_names, "^(cat|dog)")
# Set cat == 0 and dog == 1
key <- c("cat" = 0, "dog" = 1)
y <- key[catdog]
}
print(paste("Start processing", length(images_names), "images"))
## This function will resize an image, turn it into greyscale
feature_list <- pblapply(images_names, function(imgname) {
## Read image
img <- readImage(file.path(dir_path, imgname))
## Resize image
img_resized <- resize(img, w = width, h = height)
## Set to grayscale (normalized to max)
grayimg <- channel(img_resized, "gray")
## Get the image as a matrix
img_matrix <- [email protected]
## Coerce to a vector (row-wise)
img_vector <- as.vector(t(img_matrix))
return(img_vector)
})
## bind the list of vector into matrix
feature_matrix <- do.call(rbind, feature_list)
feature_matrix <- as.data.frame(feature_matrix)
## Set names
names(feature_matrix) <- paste0("pixel", c(1:img_size))
if(labelsExist){
return(list(X = feature_matrix, y = y))
}else{
return(feature_matrix)
}
}
# Takes approx. 15min
trainData <- extract_feature("/run/media/dogvscat/dogs-vs-cats/", width, height)
I would like to split this trainData to a train (70%) and a validation 30%
CodePudding user response:
The features and pixels are in a list so you will have to extract features and labels seperately. First you create a random vector of numbers form 1 to number of images (8000) using sample
. You will use the vector to extract the same random pixel vectors and labels in the same order. You can than just use the []
operators on the trainData
list and on the feature_matrix
data frame and the y
vector which are inside the trainData
list.
# simulated example
# 8000 images in rows with 100 pixels per image
# we have to select images row wise
feature_matrix <- matrix(rnorm(800000), ncol = 100)
feature_matrix <- as.data.frame(feature_matrix)
# simulated labels
y = sample(0:1, size = 8000, replace = T)
trainData <- list(X = feature_matrix, y = y)
# random vector for image selection
# optional - you can add set.seed for reproducibility
set.seed(77)
rnd = sample(1:8000, size = 0.7*8000)
# select images and labels seperately in list
train = list(trainData$X[rnd,], trainData$y[rnd])
test = list(trainData$X[-rnd,], trainData$y[-rnd])