I have a pipe delimited file with several embedded '\n' characters per row, but a unique pattern that I would like to substitute as a '\n' prior to importing into R.
For example, a sample text document might look like:
COL1|COL2|COL3|COL4
ID1|num1|num2|text\n text\n text[uniquepattern]\n
ID2|num3|num4|text2\n tex2\n text[uniquepattern]\n
I would ideally like the above to be loaded into R as a dataframe with two rows as follows:
COL1 | COL2 | COL3 | COL4 |
---|---|---|---|
ID1 | num1 | num2 | text text text |
ID2 | num3 | num4 | text2 text2 text2 |
Without specifying that [uniquepattern] should be treated as a newline, R will upload this row as several rows. My initial solution was to use shell scripting to process the file beforehand. Something like:
tr '\n' ' ' < original_file.txt > temp_file.txt
tr '[uniquepattern]' '\n' < temp_file.txt > final_file.txt
However this doesn't seem to work. Many thanks for any suggestions!
CodePudding user response:
I think this is what you want:
library(tidyverse)
file1 <- read_lines("COL1|COL2|COL3|COL4
ID1|num1|num2|text\n text\n text[uniquepattern]\n
ID2|num3|num4|text2\n tex2\n text[uniquepattern]\n")
unsplit_df <- paste(file1[2:length(file1)], collapse = "") %>%
str_split("\\[uniquepattern\\]") %>%
unlist() %>%
as_tibble_col(file1[1]) %>%
filter(str_detect(.[[1]], "[:alnum:]"))
separate(unsplit_df, col = 1, into = unlist(str_split(colnames(unsplit_df), "\\|")), sep = "\\|")
# # A tibble: 2 × 4
# COL1 COL2 COL3 COL4
# <chr> <chr> <chr> <chr>
# 1 ID1 num1 num2 text text text
# 2 ID2 num3 num4 text2 tex2 text