Suppose I have a dataframe df
as below:
date CPI:YoY PPI:MoM
1 2018-01-15 NA NA
2 2018-01-31 1.2 0.2
3 2018-01-15 NA NA
4 2018-02-28 0.9 -0.1
5 2018-03-15 NA NA
6 2018-03-31 1.0 0.3
I need to implement the following two steps with dplyr
: select the columns that do not contain date
, and then drop the rows where all values in those columns are NA
s.
To select columns not containing date
can be implemented with the code: dplyr::select(grep("date", names(df)))
or dplyr::select(contains("date", ignore.case = TRUE))
; to drop NA rows in the selected columns we can use: filter_at(vars(cols), all_vars(!is.na(.)))
or dplyr::drop_na(cols)
.
Now my question is how to combine the two codes into one? Thanks.
Note: For this example, we can simply use drop_na(CPI:YoY, PPI:MoM)
to get the expected result, but since in real data, we usually have many columns, and the position of date
column doesn't have to be the first column all the time, so I'd prefer to subset the columns by excluding the date
column.
The expected result:
date CPI:YoY PPI:MoM
1 2018-01-31 1.2 0.2
2 2018-02-28 0.9 -0.1
3 2018-03-31 1.0 0.3
References:
Subset data to contain only columns whose names match a condition
filtering data frame based on NA on multiple columns
CodePudding user response:
You could use filter
if_any/if_all
:
df %>%
filter( !if_all(-contains("date"), is.na) )
df %>%
filter( if_any(-contains("date"), ~ !is.na(.)) )
df %>%
filter( if_any(-contains("date"), Negate(is.na)) )
Note: Negate(is.na)
= function(x) !is.na(x)
= ~ !is.na(.)
Output
date CPI:YoY PPI:MoM
2 2018-01-31 1.2 0.2
4 2018-02-28 0.9 -0.1
6 2018-03-31 1.0 0.3
Data
df <- read.table(text =
" date CPI:YoY PPI:MoM
1 2018-01-15 NA NA
2 2018-01-31 1.2 0.2
3 2018-01-15 NA NA
4 2018-02-28 0.9 -0.1
5 2018-03-15 NA NA
6 2018-03-31 1.0 0.3", check.names = FALSE)