Home > Blockchain >  R subset dataframe where no observations of certain variables
R subset dataframe where no observations of certain variables

Time:10-18

I have a dataframe that looks like

country sector data1 data2
France 1 7 .
France 2 10 .
belgium 1 12 7
belgium 2 14 8

I want to subset columns that are missing for a country in all sectors. In this example I would like to drop/exclude column two because it is missing for sector 1 and 2 for france. To be clear I would also be throwing out the values of data2 for belgium in this example.

My expected output would look like

country sector data1
France 1 7
France 2 10
belgium 1 12
belgium 2 14

data 2 is now excluded because it had a complete set of missing values for all sectors in France

CodePudding user response:

We may group by country, create logical columns where the count of NA elements are equal to group size, ungroup, replace the corresponding columns to NA based on the logical column and remove those columns in select

library(dplyr)
library(stringr)
df1 %>% 
  group_by(country) %>% 
  mutate(across(everything(), ~ sum(is.na(.x)) == n(),
       .names = "{.col}_lgl")) %>% 
  ungroup %>% 
  mutate(across(names(df1)[-1], ~ if(any(get(str_c(cur_column(), 
     "_lgl")) )) NA else .x)) %>% 
   select(c(where(~ !is.logical(.x) && any(complete.cases(.x)))))

-output

# A tibble: 4 × 3
  country sector data1
  <chr>    <int> <int>
1 France       1     7
2 France       2    10
3 belgium      1    12
4 belgium      2    14

If we don't use group_by, the steps can be simplified as showed in Maël's post i.e. do the grouping with a base R function within select i.e. either tapply or ave can work

df1 %>%
   select(where(~ !any(tapply(is.na(.x), df1[["country"]], 
     FUN = all))))

data

df1 <- structure(list(country = c("France", "France", "belgium", "belgium"
), sector = c(1L, 2L, 1L, 2L), data1 = c(7L, 10L, NA, 14L), data2 = c(NA, 
NA, 7L, 8L)), row.names = c(NA, -4L), class = "data.frame")

CodePudding user response:

In base R:

df1 <- read.table(header = T, text = "country   sector  data1   data2
France  1   7   NA
France  2   10  2
belgium 1   12  7
belgium 2   14  8")
df2 <- read.table(header = T, text = "country   sector  data1   data2
France  1   7   NA
France  2   10  NA
belgium 1   12  7
belgium 2   14  8")

df1[!sapply(df1, \(x) any(ave(x, df1$country, FUN = \(y) all(is.na(y)))))]

#   country sector data1 data2
# 1  France      1     7    NA
# 2  France      2    10     2
# 3 belgium      1    12     7
# 4 belgium      2    14     8

df2[!sapply(df2, \(x) any(ave(x, df2$country, FUN = \(y) all(is.na(y)))))]
#   country sector data1
# 1  France      1     7
# 2  France      2    10
# 3 belgium      1    12
# 4 belgium      2    14

Note: \ replaces function.

CodePudding user response:

For a base R solution, you can use the apply family on column names and detect if there's any NA in the values of all columns:

keep_remove <- sapply(names(data), \(x) all(!is.na(data[[x]]))) 
data <- data[, keep_remove]
  •  Tags:  
  • r
  • Related