Regex to catch similar matching word until it hits a number-CodePudding

I have this df:

data1 <- structure(list(attr = c("kind1", "kind2", "kind3", "price1", 
"price2", "packing1", "weight1", "weight2", "calorie1"), coef = c(-1.08908045977012, 
-0.732758620689656, -0.922413793103449, -0.570881226053641, 0.118773946360153, 
-0.0287356321839081, -0.168582375478927, 0.173371647509578, -0.646551724137931
), pval = c(0.0000000461586619475345, 0.000225855110699109, 0.00000354973103147522, 
0.000189625500287816, 0.506777189443937, 0.801713589134903, 0.269271977099465, 
0.33257496253009, 0.0000000192904668116847)), row.names = c(NA, 
-9L), class = "data.frame")

#      attr        coef             pval
#1    kind1 -1.08908046 0.00000004615866
#2    kind2 -0.73275862 0.00022585511070
#3    kind3 -0.92241379 0.00000354973103
#4   price1 -0.57088123 0.00018962550029
#5   price2  0.11877395 0.50677718944394
#6 packing1 -0.02873563 0.80171358913490
#7  weight1 -0.16858238 0.26927197709946
#8  weight2  0.17337165 0.33257496253009
#9 calorie1 -0.64655172 0.00000001929047

I'm trying to add by groups according to a regex that identifies similar words up to a certain point, in this case, until a number appears.

For example, in the case of my variables, there would be 5 groups:

kind
Total = kind sum
price
Total = price sum
packing 
Total= packing sum
weight 
Total = weight sum
calorie 
Total = calorie sum

I made this code, but I don't know how to position this regex or how to create it. I tried using stringr but I couldn't do what I want:

data1 %>%
  dplyr::arrange(attr) %>%
  split(f = .[,"attr"]) %>%
  purrr::map_df(., janitor::adorn_totals)

#     attr        coef             pval
# calorie1 -0.64655172 0.00000001929047
#    Total -0.64655172 0.00000001929047
#    kind1 -1.08908046 0.00000004615866
#    Total -1.08908046 0.00000004615866
#    kind2 -0.73275862 0.00022585511070
#    Total -0.73275862 0.00022585511070
#    kind3 -0.92241379 0.00000354973103
#    Total -0.92241379 0.00000354973103
# packing1 -0.02873563 0.80171358913490
#    Total -0.02873563 0.80171358913490
#   price1 -0.57088123 0.00018962550029
#    Total -0.57088123 0.00018962550029
#   price2  0.11877395 0.50677718944394
#    Total  0.11877395 0.50677718944394
#  weight1 -0.16858238 0.26927197709946
#    Total -0.16858238 0.26927197709946
#  weight2  0.17337165 0.33257496253009
#    Total  0.17337165 0.33257496253009

It sums individual rows as groups differ by number. I need a regex that captures this:

kind
price
packing
weight
calorie

That is, to capture the letters until a number appears there.

CodePudding user response：

You can create a grouping variable by removing the digits from the attr variable, and then use group_modify:

data1 %>% 
  group_by(grp = str_remove_all(attr, "[0-9]")) %>% 
  group_modify(janitor::adorn_totals, where = "row") %>%
  ungroup() %>% 
  select(-grp)

#  # A tibble: 14 × 3
#  attr         coef           pval
#  <chr>       <dbl>          <dbl>
#  1 calorie1 -0.647   0.0000000193
#  2 Total    -0.647   0.0000000193
#  3 kind1    -1.09    0.0000000462
#  4 kind2    -0.733   0.000226    
#  5 kind3    -0.922   0.00000355  
#  6 Total    -2.74    0.000229    
#  7 packing1 -0.0287  0.802       
#  8 Total    -0.0287  0.802       
#  9 price1   -0.571   0.000190    
# 10 price2    0.119   0.507       
# 11 Total    -0.452   0.507       
# 12 weight1  -0.169   0.269       
# 13 weight2   0.173   0.333       
# 14 Total     0.00479 0.602

CodePudding user response：

Something like this: We could use group_split() after extract the words to identify. Then we get a list. Here we now can iterate with map_df the function adorn_totals:

library(tidyverse)
library(janitor)

data1 %>% 
  group_split(id=str_extract(attr, '[A-Za-z] ')) %>% 
  map_dfr(., adorn_totals) %>% 
  select(-id) %>% 
  as_tibble()

  attr         coef         pval
   <chr>       <dbl>        <dbl>
 1 calorie1 -0.647   0.0000000193
 2 Total    -0.647   0.0000000193
 3 kind1    -1.09    0.0000000462
 4 kind2    -0.733   0.000226    
 5 kind3    -0.922   0.00000355  
 6 Total    -2.74    0.000229    
 7 packing1 -0.0287  0.802       
 8 Total    -0.0287  0.802       
 9 price1   -0.571   0.000190    
10 price2    0.119   0.507       
11 Total    -0.452   0.507       
12 weight1  -0.169   0.269       
13 weight2   0.173   0.333       
14 Total     0.00479 0.602