Home > Software design >  Find the maximum number in an R dataframe column of strings
Find the maximum number in an R dataframe column of strings

Time:02-22

For each cell in a particuar column of a dataframe (which here we will simply name as df), I want to find the value of maximum and minimum number that is originally represented as a string, embedded in a string. Any commas present in the cell have no special significance. These numbers should not be a percentage, so if for example 50% appears then 50 is to be excluded from consideration. The relevant column of the dataframe looks something like this:

| particular_col_name | 
| ------------------- | 
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|

So two new columns should be created with title 'maximum_number' and 'minimum number, and in the case of the first row the former should say 20 and 5 respectively. Note that 70 has been excluded because of the % sign next to it. Similarly, the second row should put 40 and 4 into the new columns.

I have tried a couple of methods (e.g. str_extract_all, regmatches, strsplit), within the dplyr 'mutate' operator, but they either give error messages (particularly regarding the input column particular_col_name) or do not output the data in an appropriate format for the maximum and minimum values to be easily identified.

Any help on this would be most appreciated please.

CodePudding user response:

library(tidyverse)

tibble(
  particular_col_name = c(
    "First Row String10. This is also a string_5, and so is this 20, exclude70%",
    "Second_Row_50%, number40. Number 4. number_15",
    "20% 30%"
  )
) %>%
  mutate(
    numbers = particular_col_name %>% map(~ {
      .x %>% str_remove_all("[0-9] %") %>% str_extract_all("[0-9] ") %>% simplify() %>% as.numeric()
    }),
    min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
    max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
  ) %>%
  select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#>   particular_col_name                                                  min   max
#>   <chr>                                                              <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e…     5    20
#> 2 Second_Row_50%, number40. Number 4. number_15                          4    40
#> 3 20% 30%                                                               NA    NA

Created on 2022-02-22 by the reprex package (v2.0.0)

CodePudding user response:

We could use str_extract_all in combination with sapply:

library(stringr)

df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9] "), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9] "), function(x) max(as.integer(x)))
  particular_col_name                                                          min   max
  <chr>                                                                      <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70%     5    70
2 Second_Row_50%, number40. Number 4. number_15                                  4    50
  • Related