When I work with my own data, I like to creat a dataset in this way :

birthdate	height	sick	color
date	numeric	logical	list
Date of birth	Weight (cm)	Is the patient sick ?	Hair color
20/03/1991	163	1	blue
10/11/1993	185	0	brown

Creating a new dataset

I like to keep the original dataset, and create a new one without the 2 first rows.

data_work <- data.table::copy(data.table::as.data.table(data))
data_work <- ds_work[i = -c(1:2)]

birthdate	height	sick	color
20/03/1991	163	1	blue
10/11/1993	185	0	brown

Labels

Then, I can use the 2nd row of the original dataset to label every columns.

label_colname <- as.character(data[2,])
data_work <- sjlabelled::set_label(data_work, label = label_colname)

Col types

I would like to basically do the same and use the first row of the original dataset to control all of my columns types. Is it possible ?

Note : I work with data.table and possible col types include : "skip", "guess", "logical", "numeric", "date", "text" or "list"

CodePudding user response：

data <- read.csv("data.csv", sep=",", header= TRUE)
data <- as.data.table(data)

when data is the data-table from your question including the first row (with the data types) after the header with column names, then the following will give you all columns with e.g. "date" in the first row as date_columns and then change the data type of these columns accordingly.

date_columns <- colnames(data)[grepl("date",data[1,])]
data[ ,(date_columns) := lapply(.SD, as.Date), .SDcols = date_columns]

just repeat these two steps for every type you want to detect (replace "date" in the code for the other data types). Afterward you can delete the first row where you specified the types from your data-table.

CodePudding user response：

Here's a general method for whatever "types" you may need.

The first premise is that we have unambiguous functions for each of the types you may have, namely

funcs <- list(
  guess   = function(z) type.convert(z, as.is = TRUE),
  skip    = identity,
  logical = function(z) !is.na(z) & tolower(z) %in% c("1", "true"),
  numeric = as.numeric,
  date    = function(z) as.Date(z, format = "%d/%m/%Y"),
  text    = as.character,
  list    = as.list)

Basic cleanup of the data to extract the types,

types <- unlist(data_work[1,])
labels <- unlist(data_work[2,])
data_work <- data_work[-(1:2),]
data_work
#     birthdate height   sick  color
#        <char> <char> <char> <char>
# 1: 20/03/1991    163      1   blue
# 2: 10/11/1993    185      0  brown

From here, we'll Map each column over its respective function:

cols <- names(data_work)[ types %in% names(funcs) ]
funs <- funcs[ types[ types %in% names(funcs) ] ]
data_work[, (cols) := Map(function(f, x) f(x), funs, .SD), 
           .SDcols = cols]
#     birthdate height   sick  color
#        <Date>  <num> <lgcl> <list>
# 1: 1991-03-20    163   TRUE   blue
# 2: 1993-11-10    185  FALSE  brown

Data

data_work <- structure(list(birthdate = c("date", "Date of birth", "20/03/1991", "10/11/1993"), height = c("numeric", "Weight (cm)", "163", "185"), sick = c("logical", "Is the patient sick ?", "1", "0"), color = c("list", "Hair color", "blue", "brown")), class = "data.frame", row.names = c(NA, -4L))
setDT(data_work)