I am having problems reading a .txt file to R. It should contain one id column and one text column in the end. The structure of the file is as follows:
"", "textOriginal"
"1," "some text"
"2," "some text"
"3," "some text"
and there are also a lot of entries in which the text is split over several lines, like that:
"4," "some
more
text"
However, I want to avoid that R makes three rows out of the fourth text, so basically I want everything inside the quotation marks to be in one row with the respective id. I tried read.table(mytext.txt), header = TRUE) but I don't get the desired result...
CodePudding user response:
Try this hack, starting with a text file:
"", "textOriginal"
"1," "some text"
"2," "some text"
"3," "some text"
"4," "some
more
text"
The code:
quux <- readLines("quux.txt")
quux2 <- stringr::str_extract_all(paste(paste0(quux, ifelse(cumsum(nchar(gsub('[^"]', '', quux))) %% 2 == 0, ",", "\n")), collapse = ""), '"[^"]*"')[[1]]
quux2
# [1] "\"\"" "\"textOriginal\"" "\"1,\""
# [4] "\"some text\"" "\"2,\"" "\"some text\""
# [7] "\"3,\"" "\"some text\"" "\"4,\""
# [10] "\"some\n more\n text\""
data.frame(matrix(quux2, ncol = 2, byrow = TRUE))
# X1 X2
# 1 "" "textOriginal"
# 2 "1," "some text"
# 3 "2," "some text"
# 4 "3," "some text"
# 5 "4," "some\n more\n text"
The overall goal is to:
- read this in as text;
- add commas to sentences that are quote-complete and add newlines (
\n
) to lines that are not; - concatenate all of that into one continuous vector;
- extract all sequences of the literal
"
, zero or more non-"
, then another"
; - use a
matrix
formation to construct thedata.frame
. (Rename as desired.)
Walk-through:
cumsum(nchar(..))
counts the number of quotation marks on a line. We do it cumulative so that (for instance, themore
line is counted as still being incomplete.ifelse(..)
based on whethercumsum
is odd (incomplete quotes) or even (quote-complete), append a newline or a comma;paste0
(append) this to the original lines;paste
(concatenate) all things into one string;- then extract all lengths of string that have a double-quote, some (or none) non-quotes, then another double-quote.
You can clean it up if you want with:
# quux3 <- ...above...
quux3[] <- lapply(quux3, gsub, pattern = '^"|\"$', replacement = '')
quux3
# X1 X2
# 1 textOriginal
# 2 1, some text
# 3 2, some text
# 4 3, some text
# 5 4, some\n more\n text
CodePudding user response:
To read this text file (saved as test.txt
)
"", "textOriginal"
"1," "some text"
"2," "some text"
"3," "some text"
"4," "some
more
text"
We can use read_delim
from readr
package with " "
set as delimiter
library(dplyr)
library(stringr)
library(readr)
text_file <- readr::read_delim("test.txt", delim = " ") %>%
rename("id" = `,`, "textOriginal" = "textOriginal")
text_file
# A tibble: 4 × 2
# id textOriginal
# <dbl> <chr>
# 1 1 "some text"
# 2 2 "some text"
# 3 3 "some text"
# 4 4 "some\r\n more\r\n text"
Then some string manipulation will get a nice output
text_file %>%
mutate(
textOriginal = str_remove_all(textOriginal, pattern = "[\\r\\n]") %>%
str_squish()
)
# A tibble: 4 × 2
# id textOriginal
# <dbl> <chr>
# 1 1 some text
# 2 2 some text
# 3 3 some text
# 4 4 some more text
CodePudding user response:
An attempt using read.csv
and strsplit
:
t <- read.csv(yourfile, header = TRUE)
strsplit(t[,1],"," )
[[1]]
[1] "1" " some text"
[[2]]
[1] "2" " some text"
[[3]]
[1] "3" " some text"
[[4]]
[1] "4" " some\n more\n text"