Read .txt file with columns seperated by quotation marks with rows over several lines in R-CodePudding

I am having problems reading a .txt file to R. It should contain one id column and one text column in the end. The structure of the file is as follows:

"", "textOriginal"
"1," "some text"
"2," "some text"
"3," "some text"

and there are also a lot of entries in which the text is split over several lines, like that:

"4," "some
      more
      text"

However, I want to avoid that R makes three rows out of the fourth text, so basically I want everything inside the quotation marks to be in one row with the respective id. I tried read.table(mytext.txt), header = TRUE) but I don't get the desired result...

CodePudding user response：

Try this hack, starting with a text file:

"", "textOriginal"
"1," "some text"
"2," "some text"
"3," "some text"
"4," "some
      more
      text"

The code:

quux <- readLines("quux.txt")
quux2 <- stringr::str_extract_all(paste(paste0(quux, ifelse(cumsum(nchar(gsub('[^"]', '', quux))) %% 2 == 0, ",", "\n")), collapse = ""), '"[^"]*"')[[1]]
quux2
#  [1] "\"\""                             "\"textOriginal\""                 "\"1,\""                          
#  [4] "\"some text\""                    "\"2,\""                           "\"some text\""                   
#  [7] "\"3,\""                           "\"some text\""                    "\"4,\""                          
# [10] "\"some\n      more\n      text\""

data.frame(matrix(quux2, ncol = 2, byrow = TRUE))
#     X1                             X2
# 1   ""                 "textOriginal"
# 2 "1,"                    "some text"
# 3 "2,"                    "some text"
# 4 "3,"                    "some text"
# 5 "4," "some\n      more\n      text"

The overall goal is to:

read this in as text;
add commas to sentences that are quote-complete and add newlines (\n) to lines that are not;
concatenate all of that into one continuous vector;
extract all sequences of the literal ", zero or more non-", then another ";
use a matrix formation to construct the data.frame. (Rename as desired.)

Walk-through:

cumsum(nchar(..)) counts the number of quotation marks on a line. We do it cumulative so that (for instance, the more line is counted as still being incomplete.
ifelse(..) based on whether cumsum is odd (incomplete quotes) or even (quote-complete), append a newline or a comma;
paste0 (append) this to the original lines;
paste (concatenate) all things into one string;
then extract all lengths of string that have a double-quote, some (or none) non-quotes, then another double-quote.

You can clean it up if you want with:

# quux3 <- ...above...
quux3[] <- lapply(quux3, gsub, pattern = '^"|\"$', replacement = '')
quux3
#   X1                           X2
# 1                    textOriginal
# 2 1,                    some text
# 3 2,                    some text
# 4 3,                    some text
# 5 4, some\n      more\n      text

CodePudding user response：

To read this text file (saved as test.txt)

"", "textOriginal"
"1," "some text"
"2," "some text"
"3," "some text"
"4," "some
      more
      text"

We can use read_delim from readr package with " " set as delimiter

library(dplyr)
library(stringr)
library(readr)

text_file <- readr::read_delim("test.txt", delim = " ") %>% 
  rename("id" = `,`, "textOriginal" = "textOriginal")

text_file

# A tibble: 4 × 2
# id textOriginal                      
# <dbl> <chr>                             
# 1     1 "some text"                       
# 2     2 "some text"                       
# 3     3 "some text"                       
# 4     4 "some\r\n      more\r\n      text"

Then some string manipulation will get a nice output

text_file %>% 
  mutate(
    textOriginal = str_remove_all(textOriginal, pattern = "[\\r\\n]") %>% 
      str_squish()
  )

# A tibble: 4 × 2
# id textOriginal  
# <dbl> <chr>         
# 1     1 some text     
# 2     2 some text     
# 3     3 some text     
# 4     4 some more text

CodePudding user response：

An attempt using read.csv and strsplit:

t <- read.csv(yourfile, header = TRUE)
strsplit(t[,1],"," )

[[1]]
[1] "1"          " some text"

[[2]]
[1] "2"          " some text"

[[3]]
[1] "3"          " some text"

[[4]]
[1] "4"                             " some\n      more\n      text"