Home > Software engineering >  How to set the type for each column from a row?
How to set the type for each column from a row?

Time:04-06

When I work with my own data, I like to creat a dataset in this way :

birthdate height sick color
date numeric logical list
Date of birth Weight (cm) Is the patient sick ? Hair color
20/03/1991 163 1 blue
10/11/1993 185 0 brown

Creating a new dataset

I like to keep the original dataset, and create a new one without the 2 first rows.

data_work <- data.table::copy(data.table::as.data.table(data))
data_work <- ds_work[i = -c(1:2)]
birthdate height sick color
20/03/1991 163 1 blue
10/11/1993 185 0 brown

Labels

Then, I can use the 2nd row of the original dataset to label every columns.

label_colname <- as.character(data[2,])
data_work <- sjlabelled::set_label(data_work, label = label_colname)

Col types

I would like to basically do the same and use the first row of the original dataset to control all of my columns types. Is it possible ?

Note : I work with data.table and possible col types include : "skip", "guess", "logical", "numeric", "date", "text" or "list"

CodePudding user response:

data <- read.csv("data.csv", sep=",", header= TRUE)
data <- as.data.table(data)

when data is the data-table from your question including the first row (with the data types) after the header with column names, then the following will give you all columns with e.g. "date" in the first row as date_columns and then change the data type of these columns accordingly.

date_columns <- colnames(data)[grepl("date",data[1,])]
data[ ,(date_columns) := lapply(.SD, as.Date), .SDcols = date_columns]

just repeat these two steps for every type you want to detect (replace "date" in the code for the other data types). Afterward you can delete the first row where you specified the types from your data-table.

CodePudding user response:

Here's a general method for whatever "types" you may need.

The first premise is that we have unambiguous functions for each of the types you may have, namely

funcs <- list(
  guess   = function(z) type.convert(z, as.is = TRUE),
  skip    = identity,
  logical = function(z) !is.na(z) & tolower(z) %in% c("1", "true"),
  numeric = as.numeric,
  date    = function(z) as.Date(z, format = "%d/%m/%Y"),
  text    = as.character,
  list    = as.list)

Basic cleanup of the data to extract the types,

types <- unlist(data_work[1,])
labels <- unlist(data_work[2,])
data_work <- data_work[-(1:2),]
data_work
#     birthdate height   sick  color
#        <char> <char> <char> <char>
# 1: 20/03/1991    163      1   blue
# 2: 10/11/1993    185      0  brown

From here, we'll Map each column over its respective function:

cols <- names(data_work)[ types %in% names(funcs) ]
funs <- funcs[ types[ types %in% names(funcs) ] ]
data_work[, (cols) := Map(function(f, x) f(x), funs, .SD), 
           .SDcols = cols]
#     birthdate height   sick  color
#        <Date>  <num> <lgcl> <list>
# 1: 1991-03-20    163   TRUE   blue
# 2: 1993-11-10    185  FALSE  brown

Data

data_work <- structure(list(birthdate = c("date", "Date of birth", "20/03/1991", "10/11/1993"), height = c("numeric", "Weight (cm)", "163", "185"), sick = c("logical", "Is the patient sick ?", "1", "0"), color = c("list", "Hair color", "blue", "brown")), class = "data.frame", row.names = c(NA, -4L))
setDT(data_work)
  • Related