I have a dataframe with two columns of interest, one which is a binary variable (so either 1 or 0) and the other column is a collection of multiple strings. For instance, there are around 1000 different strings throughout 700k observations. What I would like to do now: I would basically want to create a ratio WITHIN EACH group that computes the share of the sum of observations where the binary variable =1 divided by the sum of observations where the binary variable is either 1 or 0. In addition, i would like to do it with dplyr (in case thats possible).
my base code looks the following, where isFraud is the binary variable and merchantName the variable that represents different groups.
dt_nm %>%
group_by(merchantName) %>%
summarise(ratio_fraud = sum(isFraud == 1) / nrow(merchantName))
I know the code I inserted above does not work, however I would just like to emphasize the way i would want to code to be constructed.
I appreciate any help!!
Many thanks in advance!
CodePudding user response:
Does this give you the expected answer?
library(dplyr)
# example dataset
df <- data.frame(
isFraud = c(1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1),
merchantName = c(rep("merch1", 4), rep("merch2", 4), rep("merch3", 4))
)
df
#> isFraud merchantName
#> 1 1 merch1
#> 2 1 merch1
#> 3 0 merch1
#> 4 0 merch1
#> 5 1 merch2
#> 6 1 merch2
#> 7 1 merch2
#> 8 0 merch2
#> 9 0 merch3
#> 10 0 merch3
#> 11 0 merch3
#> 12 1 merch3
df %>%
group_by(merchantName, .drop = FALSE) %>%
summarise(ratio_fraud = sum(isFraud == 1) / n())
#> # A tibble: 3 × 2
#> merchantName ratio_fraud
#> <chr> <dbl>
#> 1 merch1 0.5
#> 2 merch2 0.75
#> 3 merch3 0.25
Created on 2022-11-10 by the reprex package (v2.0.1)
CodePudding user response:
We could just use mean
to get the proportion as it is a binary column
library(dplyr)
dt_nm %>%
group_by(merchantName) %>%
summarise(ratio_fraud = mean(isFraud), .groups = "drop")
NOTE: nrow(merchantName)
doesn't work as it is a column i.e. vector
in the 'dt_nm' which doesn't have any dim
attributes
> nrow(dt_nm$merchantName)
NULL
In case, if we meant the nrow(dt_nm)
, that would be number of rows of the whole dataset instead of the group size which would be n()