Specify number of columns to read when first row is missing values-CodePudding

I have data from a logger that inserts timestamps as rows within the comma separated data. I've sorted out a way to wrangle those timestamps into a tidy data frame (thanks to the responses to this question).

The issue I'm having now is that the timestamp lines don't have the same number of comma-separated values as the data rows (3 vs 6), and readr is defaulting to reading only in only 3 columns, despite me manually specifying column types and names for 6. Last summer (when I last used the logger) readr read the data in correctly, but to my dismay the current version (2.1.1) throws a warning and lumps columns 3:6 all together. I'm hoping that there's some option for "correcting" back to the old behaviour, or some work-around solution I haven't thought of (editing the logger files is not an option).

Example code:

library(tidyverse)

# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"

# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"

# throws warning and reads 3 columns
read_csv(
  txt1,
  col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
  col_types = "ddcddc"
)

# works correctly
read_csv(
  txt2,
  col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
  col_types = "ddcddc"
)

# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
  ~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
    NA,   NA, "Logger Start 12:34", NA, NA, NA,
  -112,   53, "N=1", 9, 15, ".25",
  -112,   53, "N=2",12, 17, ".17"
)

CodePudding user response：

Use the base read.csv then convert to typle if need be:

read.csv(text=txt1, header = FALSE,
     col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
   lon lat                  n red nir NDVI
1   NA  NA Logger Start 12:34  NA  NA   NA
2 -112  53                N=1   9  15 0.25
3 -112  53                N=2  12  17 0.17

CodePudding user response：

I think I would use read_lines and write_lines to convert the "bad CSV" into "good CSV", and then read in the converted data.

Assuming you have a file test.csv like this:

,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17

Try something like this:

library(dplyr)
library(tidyr)

read_lines("test.csv") %>% 
  # assumes all timestamp lines are the same format
  gsub(",,Logger Start (.*?)$", "\\1,,,,,,", ., perl = TRUE) %>%
  # assumes that NDVI (last column) is always present and ends with a digit
  # you'll need to alter the regex if not the case 
  gsub("^(.*?\\d)$", ",\\1", ., perl = TRUE) %>% 
  write_lines("test_out.csv")

test_out.csv now looks like this:

12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17

So we now have 7 columns, the first is the timestamp.

This code reads the new file, fills in the missing timestamp values and removes rows where n is NA. You may not want to do that, I've assumed that n is only missing because of the original row with the timestamp.

mydata <- read_csv("test_out.csv", 
                   col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>% 
  fill(ts) %>% 
  filter(!is.na(n))

The final mydata:

# A tibble: 2 x 7
  ts       lon   lat n       red   nir  NDVI
  <time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34   -112    53 N=1       9    15  0.25
2 12:34   -112    53 N=2      12    17  0.17