Home > database >  How to read file with irregularly nested quotations?
How to read file with irregularly nested quotations?

Time:10-25

I have a file with irregular quotes like the following:

"INDICATOR,""CTY_CODE"",""MGN_CODE"",""EVENT_NR"",""EVENT_NR_CR"",""START_DATE"",""PEAK_DATE"",""END_DATE"",""MAX_EXT_ON"",""DURATION"",""SEVERITY"",""INTENSITY"",""AVERAGE_AREA"",""WIDEST_AREA_PERC"",""SCORE"",""GRP_ID"""
"Spi-3,""AFG"","""",1,1,""1952-10-01"",""1952-11-01"",""1953-06-01"",""1952-11-01"",9,6.98,0.78,19.75,44.09,5,1"

It seems irregular because the first column is only wrapped in single quotes, whereas every subsequent column is wrapped in double quotes. I'd like to read it so that every column is imported without quotes (neither in the header, nor the data).

What I've tried is the following:

# All sorts of tidyverse imports
tib <- readr::read_csv("file.csv")

And I also tried the suggestions offered here:

# Base R import
DF0 <- read.table("file.csv", as.is = TRUE)
DF <- read.csv(text = DF0[[1]])

# Data table import
DT0 <- fread("file.csv", header =F)
DT <- fread(paste(DT0[[1]], collapse = "\n"))

But even when it imports the file in the latter two cases, the variable names and some of the elements are wrapped in quotation marks.

CodePudding user response:

I used data.table::fread with the quote="" option (which is "as is").
Then I cleaned the names and data by eliminating all the quotes.
The dates could be converted too, but I didn't do that.

library(data.table)
library(magittr)
DT0 <- fread('file.csv', quote = "")
DT0 %>% setnames(names(.), gsub('"', '', names(.)))
string_cols <- which(sapply(DT0, class) == 'character')
DT0[, (string_cols) := lapply(.SD, function(x) gsub('\\"', '', x)), 
    .SDcols = string_cols]

str(DT0)

Classes ‘data.table’ and 'data.frame':  1 obs. of  16 variables:
 $ INDICATOR       : chr "Spi-3"
 $ CTY_CODE        : chr "AFG"
 $ MGN_CODE        : chr ""
 $ EVENT_NR        : int 1
 $ EVENT_NR_CR     : int 1
 $ START_DATE      : chr "1952-10-01"
 $ PEAK_DATE       : chr "1952-11-01"
 $ END_DATE        : chr "1953-06-01"
 $ MAX_EXT_ON      : chr "1952-11-01"
 $ DURATION        : int 9
 $ SEVERITY        : num 6.98
 $ INTENSITY       : num 0.78
 $ AVERAGE_AREA    : num 19.8
 $ WIDEST_AREA_PERC: num 44.1
 $ SCORE           : int 5
 $ GRP_ID          : chr "1"
 - attr(*, ".internal.selfref")=<externalptr> 
  •  Tags:  
  • r csv
  • Related