Is there a way to limit the number of seps when doing read.table?-CodePudding

I have a piece of text data that I want to preprocess, and this data is in the form of:

[num|num|<String>]

However, there are spaces, commas, and "|" in the <String>. So, it is not possible to separate the data using "|". How could I implement the limit the number of divisions, or readlines when retrieving data? I tried using flush = TRUE, but it can't be used because it erases some part of the <String>.

CodePudding user response：

You probably have something like this.

123|1234|foo, bar | baz
021|3874|foo, bar | baz
123|1234|foo, bar | baz
123|1234|foo, bar | baz

You could use a lookbehind, if there's a number (?<=\\d) before the "|".

readLines('tmp.txt') |>
  strsplit('(?<=\\d)\\|', perl=TRUE)
# [[1]]
# [1] "123"            "1234"           "foo, bar | baz"
# 
# [[2]]
# [1] "021"            "3874"           "foo, bar | baz"
# 
# [[3]]
# [1] "123"            "1234"           "foo, bar | baz"
# 
# [[4]]
# [1] "123"            "1234"           "foo, bar | baz"

Note: R >= 4.1 used.

Data:

write(file='tmp.txt',
'123|1234|foo, bar | baz
021|3874|foo, bar | baz
123|1234|foo, bar | baz
123|1234|foo, bar | baz'
)

CodePudding user response：

Following similar idea from jay.sf's answer ( and borrow data from there as well)

read.table(
  text = gsub(
    "(?<=\\d)\\|", "\t",
    readLines("tmp.txt"),
    perl = TRUE
  ),
  sep = "\t"
)

which gives

   V1   V2             V3
1 123 1234 foo, bar | baz
2  21 3874 foo, bar | baz
3 123 1234 foo, bar | baz
4 123 1234 foo, bar | baz