Home > OS >  tidyr::extract with Regex
tidyr::extract with Regex

Time:03-18

I have the following data frame:

df <- data.frame(
  Name= c('AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk','ARTHROTEC 50 Bitabs 50 0.2mg 50Stk','GLUCOPHAGE Filmtabl 850mg 100Stk'),
  Aug20Cu= c(1000,1831,7430),
  Sep20Cu= c(899,822, 1000)
) 
 
  Name                                                 Aug20Cu Sep20C
1  AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk    1000    899     
2  ARTHROTEC 50 Bitabs 50 0.2mg 50Stk                  1831    822    
3  GLUCOPHAGE Filmtabl 850mg 100Stk                    7430    1000    

I would like to extract the different numbers of the first column "Name" into separate columns, achieving therefore the following result:

   Name                                              a      b     c    Aug20Cu Sep20C
1  AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk  10     160   25   1000    899     
2  ARTHROTEC 50 Bitabs 50 0.2mg 50Stk                50     0.2   NA   1831    822    
3  GLUCOPHAGE Filmtabl 850mg 100Stk                  850    NA    NA   7430    1000    

I have tried the following code:

df<-df %>% tidyr::extract(Name,c("a", "b", "c"),'(\\d (?=\\ ))(\\d (?=\\ ))(\\d (?=mg))',convert=TRUE, remove=FALSE)

or

df<-df %>% tidyr::extract(Name,c("a", "b", "c"),'(\d (?=\ |mg))',convert=TRUE, remove=FALSE)

I don't really understand Regex and I have therefore no idea of what I'm doing wrong. I have tried to create the last Regex code in "regex101.com" and there it seems to work but as soon I try it in R I have a weird result (the first letter of the strings Name).

CodePudding user response:

library(data.table)
# Make df a data.table
setDT(df)
# Extract the part ending on "mg", split on   sign
temp <- tstrsplit(gsub(".* ([0-9 .] ?)mg.*", "\\1", df$Name), "\\ ", perl = TRUE)
df[, letters[1:length(temp)] := temp][]
#                                                Name Aug20Cu Sep20Cu   a    b    c
# 1: AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk    1000     899  10  160   25
# 2:               ARTHROTEC 50 Bitabs 50 0.2mg 50Stk    1831     822  50  0.2 <NA>
# 3:                 GLUCOPHAGE Filmtabl 850mg 100Stk    7430    1000 850 <NA> <NA>

CodePudding user response:

An option is to extract the part with str_extract_all and unnest

library(dplyr)
library(stringr)
library(tidyr)
df %>% 
  mutate(col1 = str_extract_all(Name, "(?<=\\ )[0-9.] |[0-9.] (?=\\ |mg)")) %>% 
  unnest_wider(col1) %>%
  rename_with(~ c("a", "b", "c"), starts_with("...")) %>% 
  type.convert(as.is = TRUE)

-output

# A tibble: 3 × 6
  Name                                             Aug20Cu Sep20Cu     a     b     c
  <chr>                                              <int>   <int> <int> <dbl> <int>
1 AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk    1000     899    10 160      25
2 ARTHROTEC 50 Bitabs 50 0.2mg 50Stk                  1831     822    50   0.2    NA
3 GLUCOPHAGE Filmtabl 850mg 100Stk                    7430    1000   850  NA      NA
  • Related