For different analysis, I use different samples, but I need to make it transparent how the sample came about. Stata shows me "XX observations dropped" after each drop command. Is there a way to get R to state the number of observations dropped in the console during a "tidyverse styled" sample selection (see below)? In this example I would like to see in the console how many observations were dropped with the filter command and the drop_na command. I tried summarise_all(~sum(is.na(.)))
but it was unsuccessful.
capmkt_df <- stata_df %>%
filter(change != 1 & reg_mkt == 1) %>%
select(any_of(capmkt_vars)) %>%
mutate_at(vars(country, year), factor) %>%
drop_na()
CodePudding user response:
Since you're using tidyverse
packages, a good resource is tidylog
, a package that provides additional information for a lot of tidyverse
(including dplyr
and tidyr
) functions.
For example, using drop_na
, you'll get a message drop_na: removed X rows
. An illustration with the base R airquality
dataset:
library(tidyverse)
library(tidylog, warn.conflicts = F)
air_quality %>%
drop_na()
# drop_na: removed 42 rows (27%), 111 rows remaining
# Ozone Solar.R Wind Temp Month Day
# 1 41 190 7.4 67 5 1
# 2 36 118 8.0 72 5 2
# 3 12 149 12.6 74 5 3
# 4 18 313 11.5 62 5 4
# 5 23 299 8.6 65 5 7
# 6 19 99 13.8 59 5 8
# 7 8 19 20.1 61 5 9
# 8 16 256 9.7 69 5 12
# 9 11 290 9.2 66 5 13
# 10 14 274 10.9 68 5 14
# ...
CodePudding user response:
One option is to print a sum of not complete.cases
before dropping the NA
values. Here, we can use the tee pipe (%T>%
) from magrittr
to print the results along the way.
library(tidyverse)
df %>%
filter(x %in% c(1, 2, NA)) %T>%
{print(sum(!complete.cases(.)))} %>%
drop_na()
Output
So, you will see that 2 rows were dropped, as they both had NA
s.
[1] 2
# A tibble: 1 × 2
x y
<dbl> <chr>
1 1 a
So, for your code, you could write:
capmkt_df <- stata_df %>%
filter(change != 1 & reg_mkt == 1) %>%
select(any_of(capmkt_vars)) %>%
mutate_at(vars(country, year), factor) %T>%
{print(sum(!complete.cases(.)))} %>%
drop_na()
Data
df <- structure(list(x = c(1, 2, NA), y = c("a", NA, "b")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -3L))