Home > other >  Specify unique newline character (for upload to R)
Specify unique newline character (for upload to R)

Time:07-26

I have a pipe delimited file with several embedded '\n' characters per row, but a unique pattern that I would like to substitute as a '\n' prior to importing into R.

For example, a sample text document might look like:

COL1|COL2|COL3|COL4
ID1|num1|num2|text\n text\n text[uniquepattern]\n
ID2|num3|num4|text2\n tex2\n text[uniquepattern]\n

I would ideally like the above to be loaded into R as a dataframe with two rows as follows:

COL1 COL2 COL3 COL4
ID1 num1 num2 text text text
ID2 num3 num4 text2 text2 text2

Without specifying that [uniquepattern] should be treated as a newline, R will upload this row as several rows. My initial solution was to use shell scripting to process the file beforehand. Something like:

tr '\n' ' ' < original_file.txt > temp_file.txt

tr '[uniquepattern]' '\n' < temp_file.txt > final_file.txt

However this doesn't seem to work. Many thanks for any suggestions!

CodePudding user response:

I think this is what you want:

library(tidyverse)

file1 <- read_lines("COL1|COL2|COL3|COL4
ID1|num1|num2|text\n text\n text[uniquepattern]\n
ID2|num3|num4|text2\n tex2\n text[uniquepattern]\n")

unsplit_df <- paste(file1[2:length(file1)], collapse = "") %>%
  str_split("\\[uniquepattern\\]") %>%
  unlist() %>%
  as_tibble_col(file1[1]) %>%
  filter(str_detect(.[[1]], "[:alnum:]"))

separate(unsplit_df, col = 1, into = unlist(str_split(colnames(unsplit_df), "\\|")), sep = "\\|")

# # A tibble: 2 × 4
#   COL1  COL2  COL3  COL4           
#   <chr> <chr> <chr> <chr>          
# 1 ID1   num1  num2  text text text 
# 2 ID2   num3  num4  text2 tex2 text     
  • Related