I have a number of strings (CIGARs) that I am trying to sum the numbers that occur before the number preceding "I"
. The position that "I"
occurs is highly variable but always has a number before it.
Here is a sample df:
df <- data.frame(String = c("220M1I","10I200M","5M2D1I20M","22M5D2M3I5M"))
My desired output looks like:
String Sum_prior
1 220M1I 220
2 10I200M 0
3 5M2D1I20M 7
4 22M5D2M3I5M 29
I have a partial solution which can't handle >1 digit numbers prior to "I"
which is problematic.
sum_fun <- function(x) {
str_match_all(x, "\\d (?!I)") %>%
unlist() %>%
as.numeric() %>%
sum()
}
then applying to df:
df <- df %>% rowwise() %>% mutate(output = sum_fun(String))
df
String output
<chr> <dbl>
1 220M1I 220 #Good
2 10I200M 201 #The 1 in 10 is being included
3 5M2D1I20M 27 #Don't want last 20 included
4 22M5D2M3I5M 34 #Don't want last 5 included
But I can't figure out how to adapt the regex to ignore all numbers immeadiately prior to "I"
and sum all other numbers before "I"
.
A more advanced example I need (but less important), is to calculate the cumulative number when there is more than one "I"
- the first occurrence is as above (output_1), but the second (or more) (output_2) example includes the preceeding "I"
number.
df2 <- data.frame(String =c("5M10I200M20I","100M2D3I105M1I10M")
String Output_1 Output_2
1 5M10I200M20I 5 215
2 100M2D3I105M1I10M 102 210
Any help is appreciated.
CodePudding user response:
Here is a base R approach:
df <- data.frame(String = c("220M1I","10I200M","5M2D1I20M","22M5D2M3I5M"))
x <- sub("\\d I.*$", "", df$String)
df$Sum_prior <- sapply(strsplit(x, "\\D"), function(y) sum(as.numeric(y)))
df
String Sum_prior
1 220M1I 220
2 10I200M 0
3 5M2D1I20M 7
4 22M5D2M3I5M 29
The strategy here is to first strip off the number followed by I
, until the end of the string. Then, we string split on non numeric digits, to generate a vector of string numbers. Finally, we sum those numbers to get the final result.
CodePudding user response:
Another approach is to extract all the numbers followed by characters and sum
the numbers before the occurrence of 'I'
.
library(dplyr)
library(stringr)
sum_fun <- function(x) {
tmp <- str_match_all(x, "(\\d )[A-Z] ")[[1]]
sum(as.numeric(tmp[, 2])[seq_len(grep('I', tmp[, 1]) - 1)])
}
df %>%
rowwise() %>%
mutate(output = sum_fun(String)) %>%
ungroup
# String output
# <chr> <dbl>
#1 220M1I 220
#2 10I200M 0
#3 5M2D1I20M 7
#4 22M5D2M3I5M 29
CodePudding user response:
Another base R
plus stringr
approach in one line:
(library(stringr)
df$Sum <- lapply(str_extract_all(sub("\\d I.*$", "", df$String), "\\d "), function(x) sum(as.numeric(x)))
String Sum
1 220M1I 220
2 10I200M 0
3 5M2D1I20M 7
4 22M5D2M3I5M 29
This works in steps:
- the first is the
sub
operation which gets rid of the single digit plusI
plus the rest - the second is the
str_extract_all
part which extracts all remaining digits into a list - the third is the
lapply
part where we perform the mathematical operation on the listed digits
CodePudding user response:
Another option is using dplyr
, stringr
, and purrr
:
library(dplyr)
library(purrr)
library(stringr)
df %>%
# Steps 1 (remove irrelevant part of string) and 2 (extract numbers):
mutate(String_new = str_extract_all(sub("\\d I.*$", "", String), "\\d ")) %>%
# Step 3: convert to numeric and perform calculation:
mutate(String_new = map_dbl(String_new, function(x) sum(as.numeric(x))))
String String_new
1 220M1I 220
2 10I200M 0
3 5M2D1I20M 7
4 22M5D2M3I5M 29