how reshape one part of the data in R-CodePudding

I have a dataset that looks like this:

df <- tibble::tribble(
  ~subcateg, ~names,
  "A00", "Kidney failure",
  "A001", "Kidney failure reason1",
  "A002", "Kidney failure reason2",
  "A003", "Kidney failure reason3",
  "B00", "Heart failure",
  "B001", "Heart failure reason1",
  "B002", "Heart failure reason2",
  "B003", "Heart failure reason3",
  "B00", "Lung failure",
  "B001", "Lung failure reason1",
  "B002", "Lung failure reason2",
  "B003", "Lung failure reason3",
)

It has categories (3 characters) and subcategories (4 characters) in the same variable, and I need another variable with the category of 3 characters. I would like it to look like this:

df2 <- tibble::tribble(
  ~subcateg, ~names, ~categ, ~names2,
  "A001", "Kidney failure reason1", "A00", "Kidney failure",
  "A002", "Kidney failure reason2","A00", "Kidney failure",
  "A003", "Kidney failure reason3","A00", "Kidney failure",
  "B001", "Heart failure reason1",  "B00", "Heart failure",
  "B002", "Heart failure reason2",  "B00", "Heart failure",
  "B003", "Heart failure reason3",  "B00", "Heart failure",
  "B001", "Lung failure reason1",  "B00", "Lung failure",
  "B002", "Lung failure reason2",  "B00", "Lung failure",
  "B003", "Lung failure reason3",  "B00", "Lung failure",
)

Any ideas? Thank you very much!

CodePudding user response：

We create a grouping variable based on the occurrence of 3 characters (nchar) from 'subcateg'), create the 'categ' as the first element of 'subcateg', remove the first row (slice), and create the 'names2' by removing the reason followed by digits substring from the 'names' column

library(dplyr)
library(stringr)
df %>% 
   group_by(grp = cumsum(nchar(subcateg) == 3)) %>%  
   mutate(categ = first(subcateg)) %>% 
   slice(-1) %>% 
   ungroup %>% 
   select(-grp) %>%
   mutate(names2 = str_remove(names, "\\s reason\\d "))

-output

# A tibble: 9 × 4
  subcateg names                  categ names2        
  <chr>    <chr>                  <chr> <chr>         
1 A001     Kidney failure reason1 A00   Kidney failure
2 A002     Kidney failure reason2 A00   Kidney failure
3 A003     Kidney failure reason3 A00   Kidney failure
4 B001     Heart failure reason1  B00   Heart failure 
5 B002     Heart failure reason2  B00   Heart failure 
6 B003     Heart failure reason3  B00   Heart failure 
7 B001     Lung failure reason1   B00   Lung failure  
8 B002     Lung failure reason2   B00   Lung failure  
9 B003     Lung failure reason3   B00   Lung failure

CodePudding user response：

If the lung failure category begins with C (and not with B) -- is that a mistake? --, another solution is the following:

library(tidyr)
library(dplyr)

df %>% 
  separate(subcateg,"categ",sep = "[1-9]", extra = "drop", remove = F) %>% 
  inner_join(df,by=c("categ" = "subcateg"),suffix = c("", "2")) %>% 
  filter(!stringr::str_ends(subcateg,"00")) %>% 
  relocate(categ, .after = names)