Home > database >  r arrow set column type/schema to char for all columns
r arrow set column type/schema to char for all columns

Time:03-02

{arrow}s auto-detection of column types is causing me some trouble when opening a large csv file. In particular, it drops leading zeroes for some identifiers and does some other unfortunate stuff. As the dataset is quite wide (a few hundred cols) and I don't want to set all schema values manually, I would like to somehow programatically set it.

A good start would be to convert all columns to character when opening the dataset with arrow::open_dataset. Or correct the existing datase_connection$schema object for particular columns.

However, I was not able to find out how to do so.

CodePudding user response:

When you use arrow::open_dataset() you can manually define a schema which determines the column names and types. I've pasted an example below, which shows the default behaviour of auto-detecting column names types first, and then using a schema to override this and specify your own column names and types. The example here does this programmatically as requested but you can define a schema by hand too.

library(arrow)

write_dataset(mtcars, "mtcars")

# opens the dataset with column detection
dataset <- open_dataset("mtcars")
dataset
#> FileSystemDataset with 1 Parquet file
#> mpg: double
#> cyl: double
#> disp: double
#> hp: double
#> drat: double
#> wt: double
#> qsec: double
#> vs: double
#> am: double
#> gear: double
#> carb: double
#> 
#> See $metadata for additional Schema metadata

# define new schema automatically
chosen_schema <- schema(
  purrr::map(names(dataset), ~Field$create(name = .x, type = string()))
)

# now opens the dataset with the chosen schema
open_dataset("mtcars", schema = chosen_schema) 
#> FileSystemDataset with 1 Parquet file
#> mpg: string
#> cyl: string
#> disp: string
#> hp: string
#> drat: string
#> wt: string
#> qsec: string
#> vs: string
#> am: string
#> gear: string
#> carb: string
  • Related