I am analyzing several variables that I'm creating from a big database. They are mostly dummies or categorical, and they usually come in PAIRS, and they are part of a much larger data frame.
For variable, I want to print clean calculations about it:
- Two tables: each one with the frequency of each value (which includes NA even when it's 0);
- A summary with the mean of both
Something like this:
Var01:
0 1 <NA>
50395 40292 0
Var02:
0 1 <NA>
13757 76930 0
Means:
Var01 Var02
1 68.39% 96.39%
I just need to see these results once, not to save them.
The names of the variables are actually complicated (for instance: dm_idade_0a17_pre
), and I didn't want to copy and paste them too many times as I was doing before.
I tried to do it creating temporary variables plus the functions table()
and summary()
. I used a custom function to see the means as percentage (called it percent()
).
The problem is just that the table function isn't showing me the NAME of the variable.
So, my coding is something like this:
###########
# CUSTOM FUNCTION
percent <- function(x, digits = 3, format = "f", ...) {
paste0(formatC(x * 100, format = format, digits = digits, ...), "%")
}
# ORIGINAL DATA FRAME
df <- data.frame(
ch_name = letters[1:5],
ch_key = c(1:5))
# 1st new variable =
df$ab_cd <- sample(0:1,5,replace = TRUE)
# 2nd new variable =
df$ab_cd_e <- sample(0:1,5,replace = TRUE)
# CREATING TEMPORARY VARIABLES
{
x1 <- df$ab_cd
x2 <- df$ab_cd_e
y1 <- table(x1, useNA = 'always')
y2 <- table(x2, useNA = 'always')
z1 <- data.frame(
"ab_cd" = percent(mean(x1)),
"ab_cd_e" = percent(mean(x2)))
# PRINTING THEM
cat("\014")
print(y1)
print(y2)
z1
}
###########
The result I would get is this:
x1
0 1 <NA>
2 3 0
x2
0 1 <NA>
3 2 0
ab_cd ab_cd_e
1 60.00% 40.00%
If the names of the variables x1
and x2
were the original names of the columns I used, my problem would be solved (it's ugly, but better than nothing).
Thank you all for your attention!
(Please: This might look a lazy thing, but bear in mind that I still need to do this over 80 times. Each time, the names of the variables aren't clean enough: they are similar, which makes CTRL F
or double-clicking too slow. Hope you all understand!)
CodePudding user response:
You could do something like this:
f <- function(s1,s2) {
cat(s1)
print(table(df[[s1]],useNA='always',deparse.level=0))
cat(s1)
print(table(df[[s1]],useNA='always',deparse.level=0))
setNames(
data.frame(percent(mean(df[[s1]], na.rm=T)),percent(mean(df[[s2]], na.rm=T))),
c(s1,s2)
)
}
Usage:
f("ab_cd", "ab_cd_e")
Output:
ab_cd
0 1 <NA>
1 4 0
ab_cd
0 1 <NA>
1 4 0
ab_cd ab_cd_e
1 80.000% 40.000%