Subset text and fill out results in cols based on multiple conditions in R-CodePudding

Let us say the dataset I have is like this:

label<-c(" "," ", " ", " ", " "," "," ")
def<-c("54.abc","55a.yy?(choice=agd)","55a.yy?(choice=sbs)","55a.yy?(choice=was)","55a.yy?(choice=bbs)","56b.zz?(choice=ksx)","56b.zz?(coice=ixi")
term<-c("","","","","","","")
parentlabel<-c("","","","","","","")
df<-data.frame(label, def, term, parentlabel)

So the initial dataset looks like this:

label     def                    term   parentlabel
          54.abc
          55a.yy?(choice=agd)
          55a.yy?(choice=sbs)
          55a.yy?(choice=was)
          55a.yy?(choice=bbs)
          56b.zz?(choice=ksx)
          56b.zz?(choice=ixi)

What I want to do is:

Step1: Decide whether the col "def" has same repeated beginning name (like 55a. 56b.)
If Step1(they have same beginning name) is true, I need to add a new row: a) For the new row, the col of "label" is filled out the same part of name for each cluster (like 55a.yy? 56bzz?); b) The col of "term" is filled out as "multifilter"; c) The col of "def" is write by any text (like here I redefine it as "question55a");

For the existing rows: a) the col of label is filled out the different part of the name (like agd, sbs, was, bbs, ksx, ixi); b) The col of "term" will filled out "value"; c) the col of "parentlabel" is filled out the same as col "def" for added row ;
If step1 is False, then pass col "def" to col "label" (like:label=def);

So the ideal output would look like:

label     def                    term                  parentlabel
          54.abc
55a.yy    question55a            multifilter
agd       55a.yy?(choice=agd)    value                  question55a
sbs       55a.yy?(choice=sbs)    value                  question55a
was       55a.yy?(choice=was)    value                  question55a
bbs       55a.yy?(choice=bbs)    value                  question55a
56b.zz    question56b           multifilter
ksx       56b.zz?(choice=ksx)    value                  question56b
ixi       56b.zz?(choice=ixi)    value                  question56b

It is a little bit complicated process,I can only write the process but fail to convert them as a chunk of code. Hence, I appreciate it if someone could help. Thanks a lot~~!

CodePudding user response：

I'm not convinced that this approach is very stable applied to a larger dataset, but it could give you some clues how to handle this:

library(dplyr)
library(stringr)
library(purrr)

df %>% 
  group_by(grp = str_extract(def, "^\\d{2}\\w*.\\w{2}")) %>% 
  group_split() %>% 
  map_dfr(
    ~add_row(
      .x, 
      label = unique(.$grp), 
      def = paste0("question", str_extract(label, ".*(?=\\.)")),
      term = "multifilter",
      .before = 1) %>% 
      mutate(
        label = ifelse(!is.na(grp), str_extract(def, "(?<=\\w=)\\w*"), label),
        term = ifelse(!is.na(grp), "value", term),
        parentlabel = ifelse(!is.na(grp), first(def), parentlabel)
        )
    ) %>% 
  select(-grp) %>% 
  filter(!str_detect(def, "\\d$")) %>% 
  mutate(across(c(term, parentlabel), 
                ~ ifelse(str_detect(def, "\\d \\."), NA, .x))
  )

By using a lot of tidyverse functions, you get

# A tibble: 9 x 4
  label  def                 term        parentlabel
  <chr>  <chr>               <chr>       <chr>      
1 NA     54.abc              NA          NA         
2 55a.yy question55a         multifilter NA         
3 agd    55a.yy?(choice=agd) value       question55a
4 sbs    55a.yy?(choice=sbs) value       question55a
5 was    55a.yy?(choice=was) value       question55a
6 bbs    55a.yy?(choice=bbs) value       question55a
7 56b.zz question56b         multifilter NA         
8 ksx    56b.zz?(choice=ksx) value       question56b
9 ixi    56b.zz?(coice=ixi)  value       question56b