Home > database >  Read .txt file with columns seperated by quotation marks with rows over several lines in R
Read .txt file with columns seperated by quotation marks with rows over several lines in R

Time:07-11

I am having problems reading a .txt file to R. It should contain one id column and one text column in the end. The structure of the file is as follows:

"", "textOriginal"
"1," "some text"
"2," "some text"
"3," "some text"

and there are also a lot of entries in which the text is split over several lines, like that:

"4," "some
      more
      text"

However, I want to avoid that R makes three rows out of the fourth text, so basically I want everything inside the quotation marks to be in one row with the respective id. I tried read.table(mytext.txt), header = TRUE) but I don't get the desired result...

CodePudding user response:

Try this hack, starting with a text file:

"", "textOriginal"
"1," "some text"
"2," "some text"
"3," "some text"
"4," "some
      more
      text"

The code:

quux <- readLines("quux.txt")
quux2 <- stringr::str_extract_all(paste(paste0(quux, ifelse(cumsum(nchar(gsub('[^"]', '', quux))) %% 2 == 0, ",", "\n")), collapse = ""), '"[^"]*"')[[1]]
quux2
#  [1] "\"\""                             "\"textOriginal\""                 "\"1,\""                          
#  [4] "\"some text\""                    "\"2,\""                           "\"some text\""                   
#  [7] "\"3,\""                           "\"some text\""                    "\"4,\""                          
# [10] "\"some\n      more\n      text\""

data.frame(matrix(quux2, ncol = 2, byrow = TRUE))
#     X1                             X2
# 1   ""                 "textOriginal"
# 2 "1,"                    "some text"
# 3 "2,"                    "some text"
# 4 "3,"                    "some text"
# 5 "4," "some\n      more\n      text"

The overall goal is to:

  1. read this in as text;
  2. add commas to sentences that are quote-complete and add newlines (\n) to lines that are not;
  3. concatenate all of that into one continuous vector;
  4. extract all sequences of the literal ", zero or more non-", then another ";
  5. use a matrix formation to construct the data.frame. (Rename as desired.)

Walk-through:

  • cumsum(nchar(..)) counts the number of quotation marks on a line. We do it cumulative so that (for instance, the more line is counted as still being incomplete.
  • ifelse(..) based on whether cumsum is odd (incomplete quotes) or even (quote-complete), append a newline or a comma;
  • paste0 (append) this to the original lines;
  • paste (concatenate) all things into one string;
  • then extract all lengths of string that have a double-quote, some (or none) non-quotes, then another double-quote.

You can clean it up if you want with:

# quux3 <- ...above...
quux3[] <- lapply(quux3, gsub, pattern = '^"|\"$', replacement = '')
quux3
#   X1                           X2
# 1                    textOriginal
# 2 1,                    some text
# 3 2,                    some text
# 4 3,                    some text
# 5 4, some\n      more\n      text

CodePudding user response:

To read this text file (saved as test.txt)

"", "textOriginal"
"1," "some text"
"2," "some text"
"3," "some text"
"4," "some
      more
      text"

We can use read_delim from readr package with " " set as delimiter

library(dplyr)
library(stringr)
library(readr)

text_file <- readr::read_delim("test.txt", delim = " ") %>% 
  rename("id" = `,`, "textOriginal" = "textOriginal")
text_file

# A tibble: 4 × 2
# id textOriginal                      
# <dbl> <chr>                             
# 1     1 "some text"                       
# 2     2 "some text"                       
# 3     3 "some text"                       
# 4     4 "some\r\n      more\r\n      text"

Then some string manipulation will get a nice output

text_file %>% 
  mutate(
    textOriginal = str_remove_all(textOriginal, pattern = "[\\r\\n]") %>% 
      str_squish()
  )

# A tibble: 4 × 2
# id textOriginal  
# <dbl> <chr>         
# 1     1 some text     
# 2     2 some text     
# 3     3 some text     
# 4     4 some more text

CodePudding user response:

An attempt using read.csv and strsplit:

t <- read.csv(yourfile, header = TRUE)
strsplit(t[,1],"," )

[[1]]
[1] "1"          " some text"

[[2]]
[1] "2"          " some text"

[[3]]
[1] "3"          " some text"

[[4]]
[1] "4"                             " some\n      more\n      text"

  • Related