For each cell in a particuar column of a dataframe (which here we will simply name as df), I want to find the value of maximum and minimum number that is originally represented as a string, embedded in a string. Any commas present in the cell have no special significance. These numbers should not be a percentage, so if for example 50% appears then 50 is to be excluded from consideration. The relevant column of the dataframe looks something like this:
| particular_col_name |
| ------------------- |
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|
So two new columns should be created with title 'maximum_number' and 'minimum number, and in the case of the first row the former should say 20 and 5 respectively. Note that 70 has been excluded because of the % sign next to it. Similarly, the second row should put 40 and 4 into the new columns.
I have tried a couple of methods (e.g. str_extract_all, regmatches, strsplit), within the dplyr 'mutate' operator, but they either give error messages (particularly regarding the input column particular_col_name) or do not output the data in an appropriate format for the maximum and minimum values to be easily identified.
Any help on this would be most appreciated please.
CodePudding user response:
library(tidyverse)
tibble(
particular_col_name = c(
"First Row String10. This is also a string_5, and so is this 20, exclude70%",
"Second_Row_50%, number40. Number 4. number_15",
"20% 30%"
)
) %>%
mutate(
numbers = particular_col_name %>% map(~ {
.x %>% str_remove_all("[0-9] %") %>% str_extract_all("[0-9] ") %>% simplify() %>% as.numeric()
}),
min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
) %>%
select(-numbers)
#> Warning in min(.): no non-missing arguments to min; returning Inf
#> Warning in max(.): no non-missing arguments to max; returning -Inf
#> # A tibble: 3 x 3
#> particular_col_name min max
#> <chr> <dbl> <dbl>
#> 1 First Row String10. This is also a string_5, and so is this 20, e… 5 20
#> 2 Second_Row_50%, number40. Number 4. number_15 4 40
#> 3 20% 30% NA NA
Created on 2022-02-22 by the reprex package (v2.0.0)
CodePudding user response:
We could use str_extract_all
in combination with sapply
:
library(stringr)
df$min <- sapply(str_extract_all(df$particular_col_name, "[0-9] "), function(x) min(as.integer(x)))
df$max <- sapply(str_extract_all(df$particular_col_name, "[0-9] "), function(x) max(as.integer(x)))
particular_col_name min max
<chr> <int> <int>
1 First Row String10. This is also a string_5, and so is this 20, exclude70% 5 70
2 Second_Row_50%, number40. Number 4. number_15 4 50