Home > Software design >  splitting text to create new variable
splitting text to create new variable

Time:01-10

I have this first dataset, and I want to create the desired dataset by splitting the text in the first dataset, I'm wondering how could I do this:

Basically the new variables will be split after "XYZ-1" or "AAA-2". I appreciate all the help there is!Thanks!

1st dataset:

Name <- c("A B XYZ-1 Where","C AAA-2 When","ABC R SS XYZ-1 Where")
x <- data.frame(Name)

desired dataset:

Name <- c("A B XYZ-1 Where","C AAA-2 When","ABC R SS XYZ-1 Where")
Study <- c("A B XYZ-1","C AAA-2","ABC R SS XYZ-1")
Question <- c("Where","When","Where")
x <- data.frame(Name,Study,Question)

Name                      Study             Question

A B XYZ-1 Where           A B XYZ-1         Where       
C AAA-2 When              C AAA-2           When        
ABC R SS XYZ-1 Where      ABC R SS XYZ-1    Where

CodePudding user response:

Use separate - pass a regex lookaround in sep to match one or more spaces (\\s ) that follows three upper case letters and a - and a digit ([A-Z]{3}-\\d) and that precedes an uppercase letter ([A-Z])

library(tidyr)
separate(x, Name, into = c("Study", "Question"), 
     sep = "(?<=[A-Z]{3}-\\d)\\s (?=[A-Z])", remove = FALSE)

-output

                  Name          Study Question
1      A B XYZ-1 Where      A B XYZ-1    Where
2         C AAA-2 When        C AAA-2     When
3 ABC R SS XYZ-1 Where ABC R SS XYZ-1    Where

CodePudding user response:

Here is a base R solution using strsplit with regex:

df <- do.call(rbind, strsplit(x$Name, ' (?=[^ ] $)', perl=TRUE)) %>% 
  data.frame()
colnames(df) <- c("Study", "Question")
cbind(x[1], df)
                  Name          Study Question
1      A B XYZ-1 Where      A B XYZ-1    Where
2         C AAA-2 When        C AAA-2     When
3 ABC R SS XYZ-1 Where ABC R SS XYZ-1    Where
  • Related