Was hoping to get some help with this problem. So I have a column with two types of strings and I would need to split the strings into multiple columns using 2 different conditions. I can figure out how to split them individually but struggling to add maybe an IF statement to my code. This is the example dataset below:
data = data.frame(string=c("HFUFN-087836_661", "207465-125 - IK_6 Mar 2009.docx_37484956"))
For the first type of variable (with the _). I would like to split after the _. So I used the following code for that
strsplit(data$string, "-")
For variables that have.docx in them I would like to split after the docx. I cannot split based on "_" as it comes multiple times in this string. So I used the following code:
strsplit(data$string, "x_")
My question is both these types of strings appear in the same column. Is there a way to tell R if "docx" is in the string then split after x_, but if its not split on the _?
Any help would be appreciated - Thank you guys!
CodePudding user response:
Here's a tidyr
solution:
library(tidyr)
data %>%
extract(string,
into = c("1","2"), # choose your own column labels
"(.*?)_([^_] )$")
1 2
1 HFUFN-087836 661
2 207465-125 - IK_6 Mar 2009.docx 37484956
How the regex works:
The regex partitions the strings into two "capture groups" plus an underscore in-between:
(.*?)
: first capture group, matching any character (.
) zero or more times (*
) non-greedily (?
)_
: a literal underscore([^_] )$
: the second capture group, matching any character that is not an underscore ([^_]
) one or more times ($
)
Data:
data = data.frame(string=c("HFUFN-087836_661", "207465-125 - IK_6 Mar 2009.docx_37484956"))