I have a data frame that looks like this
df=tibble(col1 = c("2397_A_merge",
"3779_A_merge",
"4958_BV"))
> df
# A tibble: 3 × 1
col1
<chr>
1 2397_A_merge
2 3779_A_merge
3 4958_BV
and I want to split it in a way that looks like this
col1 col2
2397_A merge
3779_A merge
4958_BV NA
Any help or recommendation is appreciated
CodePudding user response:
We may use separate
- the pattern is not clear. If it is a fixed word 'merge', then use a regex lookaround to split at _
that precedes the 'merge'
library(tidyr)
separate(df, col1, into = c('col1', 'col2'), sep = "_(?=merge)", fill = "right")
-output
# A tibble: 3 × 2
col1 col2
<chr> <chr>
1 2397_A merge
2 3779_A merge
3 4958_BV <NA>
Or use extract
to capture substrings as a group - (...)
- the first group capture one or more digits (\\d
) from the start (^
) of the string followed by a _
and one or more upper case letters ([A-Z]
), then followed by a _
(if present) and capture the rest of characters (.*
) in second group if present
extract(df, col1, into = c("col1", "col2"), "^(\\d _[A-Z] )_?(.*)?")
# A tibble: 3 × 2
col1 col2
<chr> <chr>
1 2397_A "merge"
2 3779_A "merge"
3 4958_BV ""
Or with base R
, create a unique delimiter by replacing the _
before the lower case letters with ,
and then use read.csv
to read the column into two columns splitting at the ,
read.csv(text = sub("_([a-z] )", ",\\1", df$col1),
header = FALSE, col.names = c("col1", "col2"), na.strings = "")
col1 col2
1 2397_A merge
2 3779_A merge
3 4958_BV <NA>