Home > database >  Given a nested dataframe, how do I subset the said dataframe based on contents of one column within
Given a nested dataframe, how do I subset the said dataframe based on contents of one column within

Time:08-04

For instance, given the following nested dataframe:

library(tidyverse)

mpg_nested = mpg %>% 
  group_by(manufacturer) %>% 
  nest()

How can I use the data within the nested structure to subset the overall nested object? In this example, say I wanted to only keep manufacturers within the nested dataframe that had the category "suv", that is to say the following manufacturers:

library(tidyverse)

mpg %>% 
  group_by(manufacturer) %>% 
  filter(class == "suv") %>% 
  distinct(manufacturer)
#> # A tibble: 10 x 1
#> # Groups:   manufacturer [10]
#>    manufacturer
#>    <chr>       
#>  1 chevrolet   
#>  2 dodge       
#>  3 ford        
#>  4 jeep        
#>  5 land rover  
#>  6 lincoln     
#>  7 mercury     
#>  8 nissan      
#>  9 subaru      
#> 10 toyota

I know how to filter the actual data within the nested structure...

mpg_nested %>% 
  mutate(data = map(data, function(.x){
    .x %>% filter(class == "suv")
  }))

But how do I only keep the nested rows which have suvs?

Cheers

CodePudding user response:

Using purrr::map_lgl() to produce a logical vector input to filter():

mpg_nested %>% 
  filter(
    map_lgl(data, ~any(.x$class == "suv"))
  )
#> # A tibble: 10 x 2
#> # Groups:   manufacturer [10]
#>    manufacturer data              
#>    <chr>        <list>            
#>  1 chevrolet    <tibble [19 × 10]>
#>  2 dodge        <tibble [37 × 10]>
#>  3 ford         <tibble [25 × 10]>
#>  4 jeep         <tibble [8 × 10]> 
#>  5 land rover   <tibble [4 × 10]> 
#>  6 lincoln      <tibble [3 × 10]> 
#>  7 mercury      <tibble [4 × 10]> 
#>  8 nissan       <tibble [13 × 10]>
#>  9 subaru       <tibble [14 × 10]>
#> 10 toyota       <tibble [34 × 10]>

CodePudding user response:

If you've done the fitering of the nested data frames and you want to discard the rows where the nested data frames have 0 rows:

mpg_nested %>% 
  mutate(data = map(data, function(.x){
    .x %>% filter(class == "suv")
  })) %>%
  filter(map_int(data, nrow) > 0)
# # A tibble: 10 × 2
# # Groups:   manufacturer [10]
#    manufacturer data             
#    <chr>        <list>           
#  1 chevrolet    <tibble [9 × 10]>
#  2 dodge        <tibble [7 × 10]>
#  3 ford         <tibble [9 × 10]>
#  4 jeep         <tibble [8 × 10]>
#  5 land rover   <tibble [4 × 10]>
#  6 lincoln      <tibble [3 × 10]>
#  7 mercury      <tibble [4 × 10]>
#  8 nissan       <tibble [4 × 10]>
#  9 subaru       <tibble [6 × 10]>
# 10 toyota       <tibble [8 × 10]>

If you want to keep the nested data frames intact but filter out rows of the top level data frame that where the nested ones don't have SUVs:

mpg_nested %>% 
  filter(map_lgl(data, \(x) "suv" %in% x[["class"]]))
# # A tibble: 10 × 2
# # Groups:   manufacturer [10]
#    manufacturer data              
#    <chr>        <list>            
#  1 chevrolet    <tibble [19 × 10]>
#  2 dodge        <tibble [37 × 10]>
#  3 ford         <tibble [25 × 10]>
#  4 jeep         <tibble [8 × 10]> 
#  5 land rover   <tibble [4 × 10]> 
#  6 lincoln      <tibble [3 × 10]> 
#  7 mercury      <tibble [4 × 10]> 
#  8 nissan       <tibble [13 × 10]>
#  9 subaru       <tibble [14 × 10]>
# 10 toyota       <tibble [34 × 10]>

CodePudding user response:

An alternative idea via group_split()

library(tidyverse)
mpg_nested <- mpg %>%
  group_split(manufacturer) %>%
  map(~ .x %>% filter(class == "suv")) %>%
  purrr::discard(~ nrow(.x) == 0)
  • Related