Home > Software design >  R tibble with comma separated fields - read/write_csv() incorrectly parses data as double
R tibble with comma separated fields - read/write_csv() incorrectly parses data as double

Time:07-09

I hope the title makes sense. I will explain a bit here.

I am working with data that comes from a network performance monitoring tool running synthetic transactions (mimicking user activity by making timed and measurable transactions allowing for performance analysis and problem detection). Several of the output fields are capturing different values like Header Read Times, TLS Times, etc for multiple transactions in a single test. These fields have the data separated by comma. When the data is first retrieved from the API and converted from JSON to a tibble, theses fields are correctly parsed as:

metrics.HeaderReadTimes
"120,186,191,184,186,182,190,186,192"
"232,310,282,289,354,292,292,293,306"
...

I have verified also that these fields are typed as character when they are imported from the API and stored in the tibble. I even checked this during debug just before write_csv() gets called.

However, when I write this data to CSV for storage and then read it back in later, the output of read_csv() has these fields as if they were re-typed as double:

metrics.HeaderReadTimes
"1.34202209222205e 26"
"4.17947405424481e 26"
...

I used mutate() to type these fields as.character() on read, but that doesn't seem to fix the issue, it just gives me a double that has beeen coerced into a character.

I'm beginning to think that the best solution is to change the delimiter in those fields before I call write_csv(), but I'm unsure how to do this in an efficient manner. It's probably something stupidly obvious, and I'm going to keep researching, but I figured it wouldn't hurt to ask...

CodePudding user response:

csv-files does not store any information about the column type, why you'd want to specify the column type in readr (or alternatively save the data as .Rdata or .RDS).

read_csv("filename.csv",
         col_types = cols(metrics.HeaderReadTimes = col_character()))

CodePudding user response:

An alternative that is a little more agnostic to column names.

The issue is that the locale is inferring the commas as "grouping marks", i.e., thousands indicators. We can change that with readr::locale.

Failing:

readr::read_csv(I('a,b\n"1,2",3'))
# Rows: 1 Columns: 2
# -- Column specification -----------------------------------------------------------------------------------------------------------
# Delimiter: ","
# dbl (1): b
# i Use `spec()` to retrieve the full column specification for this data.
# i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1 x 2
#       a     b
#   <dbl> <dbl>
# 1    12     3

Working as intended:

readr::read_csv(I('a,b\n"1,2",3'), locale = readr::locale(grouping_mark = ""))
# Rows: 1 Columns: 2
# -- Column specification -----------------------------------------------------------------------------------------------------------
# Delimiter: ","
# chr (1): a
# dbl (1): b
# i Use `spec()` to retrieve the full column specification for this data.
# i Specify the column types or set `show_col_types = FALSE` to quiet this message.
# # A tibble: 1 x 2
#   a         b
#   <chr> <dbl>
# 1 1,2       3
  • Related