I have data from a logger that inserts timestamps as rows within the comma separated data. I've sorted out a way to wrangle those timestamps into a tidy data frame (thanks to the responses to this question).
The issue I'm having now is that the timestamp lines don't have the same number of comma-separated values as the data rows (3 vs 6), and readr
is defaulting to reading only in only 3 columns, despite me manually specifying column types and names for 6. Last summer (when I last used the logger) readr
read the data in correctly, but to my dismay the current version (2.1.1) throws a warning and lumps columns 3:6 all together. I'm hoping that there's some option for "correcting" back to the old behaviour, or some work-around solution I haven't thought of (editing the logger files is not an option).
Example code:
library(tidyverse)
# example data
txt1 <- "
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# example without timestamp header
txt2 <- "
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
"
# throws warning and reads 3 columns
read_csv(
txt1,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# works correctly
read_csv(
txt2,
col_names = c("lon", "lat", "n", "red", "nir", "NDVI"),
col_types = "ddcddc"
)
# this is the table that older readr versions would create
# and that I'm hoping to get back to
tribble(
~lon, ~lat, ~n, ~red, ~nir, ~NDVI,
NA, NA, "Logger Start 12:34", NA, NA, NA,
-112, 53, "N=1", 9, 15, ".25",
-112, 53, "N=2",12, 17, ".17"
)
CodePudding user response:
Use the base read.csv
then convert to typle if need be:
read.csv(text=txt1, header = FALSE,
col.names = c("lon", "lat", "n", "red", "nir", "NDVI"))
lon lat n red nir NDVI
1 NA NA Logger Start 12:34 NA NA NA
2 -112 53 N=1 9 15 0.25
3 -112 53 N=2 12 17 0.17
CodePudding user response:
I think I would use read_lines
and write_lines
to convert the "bad CSV" into "good CSV", and then read in the converted data.
Assuming you have a file test.csv
like this:
,,Logger Start 12:34
-112,53,N=1,9,15,.25
-112,53,N=2,12,17,.17
Try something like this:
library(dplyr)
library(tidyr)
read_lines("test.csv") %>%
# assumes all timestamp lines are the same format
gsub(",,Logger Start (.*?)$", "\\1,,,,,,", ., perl = TRUE) %>%
# assumes that NDVI (last column) is always present and ends with a digit
# you'll need to alter the regex if not the case
gsub("^(.*?\\d)$", ",\\1", ., perl = TRUE) %>%
write_lines("test_out.csv")
test_out.csv
now looks like this:
12:34,,,,,,
,-112,53,N=1,9,15,.25
,-112,53,N=2,12,17,.17
So we now have 7 columns, the first is the timestamp.
This code reads the new file, fills in the missing timestamp values and removes rows where n
is NA. You may not want to do that, I've assumed that n
is only missing because of the original row with the timestamp.
mydata <- read_csv("test_out.csv",
col_names = c("ts", "lon", "lat", "n", "red", "nir", "NDVI")) %>%
fill(ts) %>%
filter(!is.na(n))
The final mydata
:
# A tibble: 2 x 7
ts lon lat n red nir NDVI
<time> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 12:34 -112 53 N=1 9 15 0.25
2 12:34 -112 53 N=2 12 17 0.17