I am working with the R programming language.
In the following link (https://www.geeksforgeeks.org/how-to-find-the-percentage-of-missing-values-in-a-dataframe-in-r/), I found out a method to calculate the total percentage of NA's in a data frame :
# declaring a data frame in R
data_frame = data.frame(C1= c(1, 2, NA, 0),
C2= c( NA, NA, 3, 8),
C3= c("A", "V", "j", "y"),
C4=c(NA,NA,NA,NA))
percentage = mean(is.na(data_frame)) * 100
[1] 43.75
My Question: Is there a way to extend this to count the percentage of "any element" in the data frame?
For example, can this be used to calculate the percentage of 0's in the data set? Or the percentage of times "j" appears in the data? Or the percentage of times "2" appears in the data set?
I can do this manually:
# count percentage of "j" in the data
v1 = nrow(subset(data_frame, C1 == "j"))
v2 = nrow(subset(data_frame, C2 == "j"))
v3 = nrow(subset(data_frame, C3== "j"))
v4 = nrow(subset(data_frame, C4 == "j"))
percentage = ((v1 v2 v3 v4) / ((nrow(data_frame) * ncol(data_frame)))) * 100
[1] 6.25
# count percentage of "0" in the data (I don't think this is right, it should be written as "nrow(subset(data_frame, C1 <= 0))"?)
v1 = nrow(subset(data_frame, C1 = 0))
v2 = nrow(subset(data_frame, C2 = 0))
v3 = nrow(subset(data_frame, C3= 0))
v4 = nrow(subset(data_frame, C4 = 0))
percentage = ((v1 v2 v3 v4) / ((nrow(data_frame) * ncol(data_frame)))) * 100
But is there a faster way to do this?
Thanks!
CodePudding user response:
You can try to unlist
your data frame into a vector
vec = unlist(data_frame)
mean(vec %in% "j") * 100 # 6.25
mean(vec %in% "0") * 100 # 6.25
mean(vec %in% NA) * 100 # 43.75
CodePudding user response:
Here is a tidyverse
base R solution.
library(tidyverse)
data_frame %>%
mutate(across(everything(), ~ .x %in% "j")) %>%
unlist() %>%
mean() * 100
Output
[1] 6.25
Though this could easily be turned into a function.
calc <- function(df, val) {
df %>%
mutate(across(everything(), ~ .x %in% val)) %>%
unlist() %>%
mean() * 100
}
Output
calc(data_frame, "j") # 6.25
calc(data_frame, "0") # 6.25
calc(data_frame, NA) # 43.75