writing and reading class of columns to csv-CodePudding

For a dataframe, I'd like to save the data class of each column (eg. char, double, factor) to a csv, and then be able to read both the data, and the classes, back into R.

For example, my data might look like this:

df
#> # A tibble: 3 × 3
#>    item  cost blue 
#>   <int> <int> <fct>
#> 1     1     4 1    
#> 2     2    10 1    
#> 3     3     3 0

(code for data input here:)

library(tidyverse)
df <- tibble::tribble(
  ~item, ~cost, ~blue,
     1L,    4L,    1L,
     2L,   10L,    1L,
     3L,    3L,    0L
  )

df <- df %>% 
  mutate(blue = as.factor(blue))
df

I'm able to save the classes of the data, and the data, this way:

library(tidyverse)
classes <- map_df(df, class)

write_csv(classes, "classes.csv")
write_csv(df, "data.csv")

and I can read it back this way:

classes <- read.csv("classes.csv") %>% 
  slice(1) %>% 
  unlist()
classes
df2 <- read_csv("data.csv", col_types = classes)
df2

Is there a quicker way to do all of this?

Particularly with the way I'm saving classes and then reading it back in, then slicing and unlisting?

CodePudding user response：

You could use writeLines and its counterpart readLines for the classes. Like this:

classes <- sapply(df, class)
writeLines(classes, "classes.txt")
#to read them
readLines("classes.txt")

However, consider also other formats like parquet (the R implementation is provided by the arrow package) for instance that preserve the data types and are implemented by many languages.

CodePudding user response：

Try the csvy package. Also see the http://csvy.org/ site. This generates a single file rather than two files simplifying working with it, there are csvy readers available in some other languages as well (see link just cited), the format is standardized and backwards compatible with csv which is probably better than rolling your own format.

library(csvy)
write_csvy(df, "df.csvy")

This produces this file:

#---
#profile: tabular-data-package
#name: df
#fields:
#- name: item
#  type: integer
#- name: cost
#  type: integer
#- name: blue
#  type: integer
#--- 
item,cost,blue
1,4,1
2,10,1
3,3,0

which can be read back in using:

read_csvy("df.csvy")

or read.csv("df.csvy", comment.char = "#") or any number of R packages which have functions to read csv files.

We can extract the metadata as a list using:

library(yaml)
md <- get_yaml_header("df.csvy")
md_list <- yaml.load(paste(md, collapse = "\n"))

str(md_list)
## List of 3
##  $ profile: chr "tabular-data-package"
##  $ name   : chr "df"
##  $ fields :List of 3
##   ..$ :List of 2
##   .. ..$ name: chr "item"
##   .. ..$ type: chr "integer"
##   ..$ :List of 2
##   .. ..$ name: chr "cost"
##   .. ..$ type: chr "integer"
##   ..$ :List of 2
##   .. ..$ name: chr "blue"
##   .. ..$ type: chr "integer"