Find the maximum number in an R dataframe column of strings-CodePudding

For each cell in a particuar column of a dataframe (which here we will simply name as df), I want to find the value of maximum and minimum number that is originally represented as a string, embedded in a string. Any commas present in the cell have no special significance. These numbers should not be a percentage, so if for example 50% appears then 50 is to be excluded from consideration. The relevant column of the dataframe looks something like this:

| particular_col_name | 
| ------------------- | 
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|

So two new columns should be created with title 'maximum_number' and 'minimum number, and in the case of the first row the former should say 20 and 5 respectively. Note that 70 has been excluded because of the % sign next to it. Similarly, the second row should put 40 and 4 into the new columns.

I have tried a couple of methods (e.g. str_extract_all, regmatches, strsplit), within the dplyr 'mutate' operator, but they either give error messages (particularly regarding the input column particular_col_name) or do not output the data in an appropriate format for the maximum and minimum values to be easily identified.

Any help on this would be most appreciated please.

CodePudding user response：

library(tidyverse)

tibble(
  particular_col_name = c(
    "First Row String10. This is also a string_5, and so is this 20, exclude70%",
    "Second_Row_50%, number40. Number 4. number_15",
    "20% 30%"
  )
) %>%
  mutate(
    numbers = particular_col_name %>% map(~ {
      .x %>% str_remove_all("[0-9] %") %>% str_extract_all("[0-9] ") %>% simplify() %>% as.numeric()
    }),
    min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
    max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
  ) %>%
  select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#>   particular_col_name                                                  min   max
#>   <chr>                                                              <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e…     5    20
#> 2 Second_Row_50%, number40. Number 4. number_15                          4    40
#> 3 20% 30%                                                               NA    NA

^{Created on 2022-02-22 by the reprex package (v2.0.0)}

CodePudding user response：

We could use str_extract_all in combination with sapply:

library(stringr)

df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9] "), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9] "), function(x) max(as.integer(x)))

  particular_col_name                                                          min   max
  <chr>                                                                      <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70%     5    70
2 Second_Row_50%, number40. Number 4. number_15                                  4    50