I have a piece of text data that I want to preprocess, and this data is in the form of:
[num|num|<String>]
However, there are spaces, commas, and "|"
in the <String>.
So, it is not possible to separate the data using "|"
.
How could I implement the limit the number of divisions, or readlines when retrieving data?
I tried using flush = TRUE
, but it can't be used because it erases some part of the <String>.
CodePudding user response:
You probably have something like this.
123|1234|foo, bar | baz
021|3874|foo, bar | baz
123|1234|foo, bar | baz
123|1234|foo, bar | baz
You could use a lookbehind, if there's a number (?<=\\d)
before the "|"
.
readLines('tmp.txt') |>
strsplit('(?<=\\d)\\|', perl=TRUE)
# [[1]]
# [1] "123" "1234" "foo, bar | baz"
#
# [[2]]
# [1] "021" "3874" "foo, bar | baz"
#
# [[3]]
# [1] "123" "1234" "foo, bar | baz"
#
# [[4]]
# [1] "123" "1234" "foo, bar | baz"
Note: R >= 4.1 used.
Data:
write(file='tmp.txt',
'123|1234|foo, bar | baz
021|3874|foo, bar | baz
123|1234|foo, bar | baz
123|1234|foo, bar | baz'
)
CodePudding user response:
Following similar idea from jay.sf's answer ( and borrow data from there as well)
read.table(
text = gsub(
"(?<=\\d)\\|", "\t",
readLines("tmp.txt"),
perl = TRUE
),
sep = "\t"
)
which gives
V1 V2 V3
1 123 1234 foo, bar | baz
2 21 3874 foo, bar | baz
3 123 1234 foo, bar | baz
4 123 1234 foo, bar | baz