I have a dataframe that looks like
country | sector | data1 | data2 |
---|---|---|---|
France | 1 | 7 | . |
France | 2 | 10 | . |
belgium | 1 | 12 | 7 |
belgium | 2 | 14 | 8 |
I want to subset columns that are missing for a country in all sectors. In this example I would like to drop/exclude column two because it is missing for sector 1 and 2 for france. To be clear I would also be throwing out the values of data2 for belgium in this example.
My expected output would look like
country | sector | data1 |
---|---|---|
France | 1 | 7 |
France | 2 | 10 |
belgium | 1 | 12 |
belgium | 2 | 14 |
data 2 is now excluded because it had a complete set of missing values for all sectors in France
CodePudding user response:
We may group by country, create logical columns where the count of NA
elements are equal to group size, ungroup, replace the corresponding columns to NA based on the logical column and remove those columns in select
library(dplyr)
library(stringr)
df1 %>%
group_by(country) %>%
mutate(across(everything(), ~ sum(is.na(.x)) == n(),
.names = "{.col}_lgl")) %>%
ungroup %>%
mutate(across(names(df1)[-1], ~ if(any(get(str_c(cur_column(),
"_lgl")) )) NA else .x)) %>%
select(c(where(~ !is.logical(.x) && any(complete.cases(.x)))))
-output
# A tibble: 4 × 3
country sector data1
<chr> <int> <int>
1 France 1 7
2 France 2 10
3 belgium 1 12
4 belgium 2 14
If we don't use group_by, the steps can be simplified as showed in Maël's
post i.e. do the grouping with a base R function within select
i.e. either tapply
or ave
can work
df1 %>%
select(where(~ !any(tapply(is.na(.x), df1[["country"]],
FUN = all))))
data
df1 <- structure(list(country = c("France", "France", "belgium", "belgium"
), sector = c(1L, 2L, 1L, 2L), data1 = c(7L, 10L, NA, 14L), data2 = c(NA,
NA, 7L, 8L)), row.names = c(NA, -4L), class = "data.frame")
CodePudding user response:
In base R:
df1 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 2
belgium 1 12 7
belgium 2 14 8")
df2 <- read.table(header = T, text = "country sector data1 data2
France 1 7 NA
France 2 10 NA
belgium 1 12 7
belgium 2 14 8")
df1[!sapply(df1, \(x) any(ave(x, df1$country, FUN = \(y) all(is.na(y)))))]
# country sector data1 data2
# 1 France 1 7 NA
# 2 France 2 10 2
# 3 belgium 1 12 7
# 4 belgium 2 14 8
df2[!sapply(df2, \(x) any(ave(x, df2$country, FUN = \(y) all(is.na(y)))))]
# country sector data1
# 1 France 1 7
# 2 France 2 10
# 3 belgium 1 12
# 4 belgium 2 14
Note: \
replaces function
.
CodePudding user response:
For a base R solution, you can use the apply family on column names and detect if there's any NA in the values of all columns:
keep_remove <- sapply(names(data), \(x) all(!is.na(data[[x]])))
data <- data[, keep_remove]