When I work with my own data, I like to creat a dataset in this way :
birthdate | height | sick | color |
---|---|---|---|
date | numeric | logical | list |
Date of birth | Weight (cm) | Is the patient sick ? | Hair color |
20/03/1991 | 163 | 1 | blue |
10/11/1993 | 185 | 0 | brown |
Creating a new dataset
I like to keep the original dataset, and create a new one without the 2 first rows.
data_work <- data.table::copy(data.table::as.data.table(data))
data_work <- ds_work[i = -c(1:2)]
birthdate | height | sick | color |
---|---|---|---|
20/03/1991 | 163 | 1 | blue |
10/11/1993 | 185 | 0 | brown |
Labels
Then, I can use the 2nd row of the original dataset to label every columns.
label_colname <- as.character(data[2,])
data_work <- sjlabelled::set_label(data_work, label = label_colname)
Col types
I would like to basically do the same and use the first row of the original dataset to control all of my columns types. Is it possible ?
Note : I work with data.table
and possible col types include : "skip", "guess", "logical", "numeric", "date", "text" or "list"
CodePudding user response:
data <- read.csv("data.csv", sep=",", header= TRUE)
data <- as.data.table(data)
when data
is the data-table from your question including the first row (with the data types) after the header with column names, then the following will give you all columns with e.g. "date"
in the first row as date_columns
and then change the data type of these columns accordingly.
date_columns <- colnames(data)[grepl("date",data[1,])]
data[ ,(date_columns) := lapply(.SD, as.Date), .SDcols = date_columns]
just repeat these two steps for every type you want to detect (replace "date" in the code for the other data types). Afterward you can delete the first row where you specified the types from your data-table.
CodePudding user response:
Here's a general method for whatever "types" you may need.
The first premise is that we have unambiguous functions for each of the types you may have, namely
funcs <- list(
guess = function(z) type.convert(z, as.is = TRUE),
skip = identity,
logical = function(z) !is.na(z) & tolower(z) %in% c("1", "true"),
numeric = as.numeric,
date = function(z) as.Date(z, format = "%d/%m/%Y"),
text = as.character,
list = as.list)
Basic cleanup of the data to extract the types,
types <- unlist(data_work[1,])
labels <- unlist(data_work[2,])
data_work <- data_work[-(1:2),]
data_work
# birthdate height sick color
# <char> <char> <char> <char>
# 1: 20/03/1991 163 1 blue
# 2: 10/11/1993 185 0 brown
From here, we'll Map
each column over its respective function:
cols <- names(data_work)[ types %in% names(funcs) ]
funs <- funcs[ types[ types %in% names(funcs) ] ]
data_work[, (cols) := Map(function(f, x) f(x), funs, .SD),
.SDcols = cols]
# birthdate height sick color
# <Date> <num> <lgcl> <list>
# 1: 1991-03-20 163 TRUE blue
# 2: 1993-11-10 185 FALSE brown
Data
data_work <- structure(list(birthdate = c("date", "Date of birth", "20/03/1991", "10/11/1993"), height = c("numeric", "Weight (cm)", "163", "185"), sick = c("logical", "Is the patient sick ?", "1", "0"), color = c("list", "Hair color", "blue", "brown")), class = "data.frame", row.names = c(NA, -4L))
setDT(data_work)