I have the following data frame:
df <- data.frame(
Name= c('AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk','ARTHROTEC 50 Bitabs 50 0.2mg 50Stk','GLUCOPHAGE Filmtabl 850mg 100Stk'),
Aug20Cu= c(1000,1831,7430),
Sep20Cu= c(899,822, 1000)
)
Name Aug20Cu Sep20C
1 AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk 1000 899
2 ARTHROTEC 50 Bitabs 50 0.2mg 50Stk 1831 822
3 GLUCOPHAGE Filmtabl 850mg 100Stk 7430 1000
I would like to extract the different numbers of the first column "Name" into separate columns, achieving therefore the following result:
Name a b c Aug20Cu Sep20C
1 AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk 10 160 25 1000 899
2 ARTHROTEC 50 Bitabs 50 0.2mg 50Stk 50 0.2 NA 1831 822
3 GLUCOPHAGE Filmtabl 850mg 100Stk 850 NA NA 7430 1000
I have tried the following code:
df<-df %>% tidyr::extract(Name,c("a", "b", "c"),'(\\d (?=\\ ))(\\d (?=\\ ))(\\d (?=mg))',convert=TRUE, remove=FALSE)
or
df<-df %>% tidyr::extract(Name,c("a", "b", "c"),'(\d (?=\ |mg))',convert=TRUE, remove=FALSE)
I don't really understand Regex and I have therefore no idea of what I'm doing wrong. I have tried to create the last Regex code in "regex101.com" and there it seems to work but as soon I try it in R I have a weird result (the first letter of the strings Name).
CodePudding user response:
library(data.table)
# Make df a data.table
setDT(df)
# Extract the part ending on "mg", split on sign
temp <- tstrsplit(gsub(".* ([0-9 .] ?)mg.*", "\\1", df$Name), "\\ ", perl = TRUE)
df[, letters[1:length(temp)] := temp][]
# Name Aug20Cu Sep20Cu a b c
# 1: AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk 1000 899 10 160 25
# 2: ARTHROTEC 50 Bitabs 50 0.2mg 50Stk 1831 822 50 0.2 <NA>
# 3: GLUCOPHAGE Filmtabl 850mg 100Stk 7430 1000 850 <NA> <NA>
CodePudding user response:
An option is to extract the part with str_extract_all
and unnest
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(col1 = str_extract_all(Name, "(?<=\\ )[0-9.] |[0-9.] (?=\\ |mg)")) %>%
unnest_wider(col1) %>%
rename_with(~ c("a", "b", "c"), starts_with("...")) %>%
type.convert(as.is = TRUE)
-output
# A tibble: 3 × 6
Name Aug20Cu Sep20Cu a b c
<chr> <int> <int> <int> <dbl> <int>
1 AMLOD VALSAR HCT MPH Filmtabl 10 160 25mg 100Stk 1000 899 10 160 25
2 ARTHROTEC 50 Bitabs 50 0.2mg 50Stk 1831 822 50 0.2 NA
3 GLUCOPHAGE Filmtabl 850mg 100Stk 7430 1000 850 NA NA