Let us say the dataset I have is like this:
label<-c(" "," ", " ", " ", " "," "," ")
def<-c("54.abc","55a.yy?(choice=agd)","55a.yy?(choice=sbs)","55a.yy?(choice=was)","55a.yy?(choice=bbs)","56b.zz?(choice=ksx)","56b.zz?(coice=ixi")
term<-c("","","","","","","")
parentlabel<-c("","","","","","","")
df<-data.frame(label, def, term, parentlabel)
So the initial dataset looks like this:
label def term parentlabel
54.abc
55a.yy?(choice=agd)
55a.yy?(choice=sbs)
55a.yy?(choice=was)
55a.yy?(choice=bbs)
56b.zz?(choice=ksx)
56b.zz?(choice=ixi)
What I want to do is:
Step1: Decide whether the col "def" has same repeated beginning name (like 55a. 56b.)
If Step1(they have same beginning name) is true, I need to add a new row: a) For the new row, the col of "label" is filled out the same part of name for each cluster (like 55a.yy? 56bzz?); b) The col of "term" is filled out as "multifilter"; c) The col of "def" is write by any text (like here I redefine it as "question55a");
For the existing rows: a) the col of label is filled out the different part of the name (like agd, sbs, was, bbs, ksx, ixi); b) The col of "term" will filled out "value"; c) the col of "parentlabel" is filled out the same as col "def" for added row ;
If step1 is False, then pass col "def" to col "label" (like:label=def);
So the ideal output would look like:
label def term parentlabel
54.abc
55a.yy question55a multifilter
agd 55a.yy?(choice=agd) value question55a
sbs 55a.yy?(choice=sbs) value question55a
was 55a.yy?(choice=was) value question55a
bbs 55a.yy?(choice=bbs) value question55a
56b.zz question56b multifilter
ksx 56b.zz?(choice=ksx) value question56b
ixi 56b.zz?(choice=ixi) value question56b
It is a little bit complicated process,I can only write the process but fail to convert them as a chunk of code. Hence, I appreciate it if someone could help. Thanks a lot~~!
CodePudding user response:
I'm not convinced that this approach is very stable applied to a larger dataset, but it could give you some clues how to handle this:
library(dplyr)
library(stringr)
library(purrr)
df %>%
group_by(grp = str_extract(def, "^\\d{2}\\w*.\\w{2}")) %>%
group_split() %>%
map_dfr(
~add_row(
.x,
label = unique(.$grp),
def = paste0("question", str_extract(label, ".*(?=\\.)")),
term = "multifilter",
.before = 1) %>%
mutate(
label = ifelse(!is.na(grp), str_extract(def, "(?<=\\w=)\\w*"), label),
term = ifelse(!is.na(grp), "value", term),
parentlabel = ifelse(!is.na(grp), first(def), parentlabel)
)
) %>%
select(-grp) %>%
filter(!str_detect(def, "\\d$")) %>%
mutate(across(c(term, parentlabel),
~ ifelse(str_detect(def, "\\d \\."), NA, .x))
)
By using a lot of tidyverse
functions, you get
# A tibble: 9 x 4
label def term parentlabel
<chr> <chr> <chr> <chr>
1 NA 54.abc NA NA
2 55a.yy question55a multifilter NA
3 agd 55a.yy?(choice=agd) value question55a
4 sbs 55a.yy?(choice=sbs) value question55a
5 was 55a.yy?(choice=was) value question55a
6 bbs 55a.yy?(choice=bbs) value question55a
7 56b.zz question56b multifilter NA
8 ksx 56b.zz?(choice=ksx) value question56b
9 ixi 56b.zz?(coice=ixi) value question56b