Home > database >  Filter across all columns after a specific string
Filter across all columns after a specific string

Time:07-02

I am aiming to filter my data if I can find the string Code in any of the columns. It should filter at the first occurrence of this word.

I am working with a list and the columns containing Code change by their arrangement. So I need a general way to do this like the attempt.

Some example data:

# A tibble: 11 × 10
   `Title:`            `A&E weekly activity statistics, NHS…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
   <chr>               <chr>                                  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 Contact:            Paul Steele - [email protected]     NA    NA    NA    NA    NA    NA    NA    NA   
 2 NA                  NA                                     NA    NA    NA    NA    NA    NA    NA    NA   
 3 Provider Level Data NA                                     NA    NA    NA    NA    NA    NA    NA    NA   
 4 NA                  NA                                     NA    A&E … NA    NA    NA    A&E … NA    NA   
 5 SHA                 Code                                   Name  Type… Type… Type… Tota… Type… Type… Type…
 6 -                   -                                      Engl… 2831… 12906 1369… 4330… 15347 34    172  
 7 NA                  NA                                     NA    NA    NA    NA    NA    NA    NA    NA   
 8 Q30                 RLN                                    City… 1423  669   297   2389  202   0     18   
 9 Q30                 RXP                                    Coun… 2473  0     2088  4561  89    0     0    
10 Q30                 5J9                                    Darl… 0     0     0     0     0     0     0    
11 Q30                 RR7                                    Gate… 1251  0     0     1251  24    0     0  

My attempt:

vf %>% filter(row_number() >= across(everything(), ~ .x %>% which(. == 'Code')))

I get these error:

Error in `filter()`:
! Problem while computing `..1 = row_number() >= ...`.
Caused by error in `across()`:
! Problem while computing column `Title:`.
Caused by error in `which()`:
! argument to 'which' is not logical
Run `rlang::last_error()` to see where the error occurred.

Expected output:

# A tibble: 7 × 10
  `Title:` `A&E weekly activity statistics, NHS and indepen…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
  <chr>    <chr>                                              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 SHA      Code                                               Name  Type… Type… Type… Tota… Type… Type… Type…
2 -        -                                                  Engl… 2831… 12906 1369… 4330… 15347 34    172  
3 NA       NA                                                 NA    NA    NA    NA    NA    NA    NA    NA   
4 Q30      RLN                                                City… 1423  669   297   2389  202   0     18   
5 Q30      RXP                                                Coun… 2473  0     2088  4561  89    0     0    
6 Q30      5J9                                                Darl… 0     0     0     0     0     0     0    
7 Q30      RR7                                                Gate… 1251  0     0     1251  24    0     0    

Reproducible code:

structure(list(`Title:` = c("Contact:", NA, "Provider Level Data", 
NA, "SHA", "-", NA, "Q30", "Q30", "Q30", "Q30"), `A&E weekly activity statistics, NHS and independent sector organisations in England` = c("Paul Steele - [email protected]", 
NA, NA, NA, "Code", "-", NA, "RLN", "RXP", "5J9", "RR7"), ...3 = c(NA, 
NA, NA, NA, "Name", "England", NA, "City Hospitals Sunderland NHS Foundation Trust", 
"County Durham And Darlington NHS Foundation Trust", "Darlington PCT", 
"Gateshead Health NHS Foundation Trust"), ...4 = c(NA, NA, NA, 
"A&E attendances", "Type 1 Departments - Major A&E", "283175", 
NA, "1423", "2473", "0", "1251"), ...5 = c(NA, NA, NA, NA, "Type 2 Departments - Single Specialty", 
"12906", NA, "669", "0", "0", "0"), ...6 = c(NA, NA, NA, NA, 
"Type 3 Departments - Other A&E/Minor Injury Unit", "136985", 
NA, "297", "2088", "0", "0"), ...7 = c(NA, NA, NA, NA, "Total attendances", 
"433066", NA, "2389", "4561", "0", "1251"), ...8 = c(NA, NA, 
NA, "A&E attendances > 4 hours from arrival to admission, transfer or discharge", 
"Type 1 Departments - Major A&E", "15347", NA, "202", "89", "0", 
"24"), ...9 = c(NA, NA, NA, NA, "Type 2 Departments - Single Specialty", 
"34", NA, "0", "0", "0", "0"), ...10 = c(NA, NA, NA, NA, "Type 3 Departments - Other A&E/Minor Injury Unit", 
"172", NA, "18", "0", "0", "0")), row.names = c(NA, -11L), class = c("tbl_df", 
"tbl", "data.frame"))

CodePudding user response:

The following attempt seems to get what I am after:

data %>% filter(row_number() >= which(keep(.,~any(which(.=='Code'))) == 'Code'))

however, I am sure there are cleaner attempts than this!

What I have done:

  1. filter for row numbers greater than or equal to the logical vector containing Code in any of the columns.
  2. keep retains the column that has Code in it, so filter can do its work.

Output:

# A tibble: 7 × 10
  `Title:` `A&E weekly activity statistics, NHS and indepen…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
  <chr>    <chr>                                              <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 SHA      Code                                               Name  Type… Type… Type… Tota… Type… Type… Type…
2 -        -                                                  Engl… 2831… 12906 1369… 4330… 15347 34    172  
3 NA       NA                                                 NA    NA    NA    NA    NA    NA    NA    NA   
4 Q30      RLN                                                City… 1423  669   297   2389  202   0     18   
5 Q30      RXP                                                Coun… 2473  0     2088  4561  89    0     0    
6 Q30      5J9                                                Darl… 0     0     0     0     0     0     0    
7 Q30      RR7                                                Gate… 1251  0     0     1251  24    0     0   

CodePudding user response:

There is a function cumany that can accomplish your task for you.

It returns TRUE for all cases after the first appearance of "Code"

df |> 
  filter(if_any(everything(), ~ cumany(.x %in% "Code")))
`Title:` `A&E weekly activity statistics, NHS and independent sector organisations in England` ...3          ...4  ...5  ...6  ...7  ...8  ...9  ...10
  <chr>    <chr>                                                                                 <chr>         <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 SHA      Code                                                                                  Name          Type… Type… Type… Tota… Type… Type… Type…
2 -        -                                                                                     England       2831… 12906 1369… 4330… 15347 34    172  
3 NA       NA                                                                                    NA            NA    NA    NA    NA    NA    NA    NA   
4 Q30      RLN                                                                                   City Hospita… 1423  669   297   2389  202   0     18   
5 Q30      RXP                                                                                   County Durha… 2473  0     2088  4561  89    0     0    
6 Q30      5J9                                                                                   Darlington P… 0     0     0     0     0     0     0    
7 Q30      RR7                                                                                   Gateshead He… 1251  0     0     1251  24    0     0 

CodePudding user response:

You can use str_detect to detect "Code" across all columns in combination with if_any to see if there is any column with the string and remove all rows before that specific row like this (Thanks to @Darren Tsai):

library(stringr)
library(dplyr)
vf %>% 
  filter(row_number() >= which.max(if_any(everything(), str_detect, "Code")))
#> # A tibble: 7 × 10
#>   `Title:` `A&E weekly activit…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
#>   <chr>    <chr>                 <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 SHA      Code                  Name  Type… Type… Type… Tota… Type… Type… Type…
#> 2 -        -                     Engl… 2831… 12906 1369… 4330… 15347 34    172  
#> 3 <NA>     <NA>                  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
#> 4 Q30      RLN                   City… 1423  669   297   2389  202   0     18   
#> 5 Q30      RXP                   Coun… 2473  0     2088  4561  89    0     0    
#> 6 Q30      5J9                   Darl… 0     0     0     0     0     0     0    
#> 7 Q30      RR7                   Gate… 1251  0     0     1251  24    0     0

Created on 2022-07-01 by the reprex package (v2.0.1)

Old answer:

You can use str_detect to detect "Code" across all columns and remove all rows before that specific row like this:

library(stringr)
library(dplyr)
vf %>% 
  filter(row_number() > if_all(everything(), ~!str_detect(., "Code")))
#> # A tibble: 6 × 10
#>   `Title:` `A&E weekly activit…` ...3  ...4  ...5  ...6  ...7  ...8  ...9  ...10
#>   <chr>    <chr>                 <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 SHA      Code                  Name  Type… Type… Type… Tota… Type… Type… Type…
#> 2 -        -                     Engl… 2831… 12906 1369… 4330… 15347 34    172  
#> 3 Q30      RLN                   City… 1423  669   297   2389  202   0     18   
#> 4 Q30      RXP                   Coun… 2473  0     2088  4561  89    0     0    
#> 5 Q30      5J9                   Darl… 0     0     0     0     0     0     0    
#> 6 Q30      RR7                   Gate… 1251  0     0     1251  24    0     0

Created on 2022-07-01 by the reprex package (v2.0.1)

  •  Tags:  
  • r
  • Related