Home > Back-end >  Split a column of strings (with different patterns) based on two different conditions
Split a column of strings (with different patterns) based on two different conditions

Time:11-25

Was hoping to get some help with this problem. So I have a column with two types of strings and I would need to split the strings into multiple columns using 2 different conditions. I can figure out how to split them individually but struggling to add maybe an IF statement to my code. This is the example dataset below:

data = data.frame(string=c("HFUFN-087836_661", "207465-125 - IK_6 Mar 2009.docx_37484956"))

For the first type of variable (with the _). I would like to split after the _. So I used the following code for that

strsplit(data$string, "-")

For variables that have.docx in them I would like to split after the docx. I cannot split based on "_" as it comes multiple times in this string. So I used the following code:

strsplit(data$string, "x_")

My question is both these types of strings appear in the same column. Is there a way to tell R if "docx" is in the string then split after x_, but if its not split on the _?

Any help would be appreciated - Thank you guys!

CodePudding user response:

Here's a tidyr solution:

library(tidyr)
data %>%
extract(string,
        into = c("1","2"),    # choose your own column labels
        "(.*?)_([^_] )$")
                                1        2
1                    HFUFN-087836      661
2 207465-125 - IK_6 Mar 2009.docx 37484956

How the regex works:

The regex partitions the strings into two "capture groups" plus an underscore in-between:

  • (.*?): first capture group, matching any character (.) zero or more times (*) non-greedily (?)
  • _: a literal underscore
  • ([^_] )$: the second capture group, matching any character that is not an underscore ([^_]) one or more times ( ) at the very end of he string ($)

Data:

data = data.frame(string=c("HFUFN-087836_661", "207465-125 - IK_6 Mar 2009.docx_37484956"))
  • Related