R: Is there a better way to apply grepl() for a vector with multiple elements across multiple column-CodePudding

I'm working with a dataset that shows which variables exist in each year. I need a function that will allow me to specify certain years and then return the variables that appear in ALL of those years. In other words, if specify years "70" and "90", I want the function to find all rows with variables in both of those years (based on the variable names, not the column names).

This post gave me the idea to paste all of the year columns together so I only have to apply grepl() to one column. As you can see below, the only way I could figure to match ALL years was to run the for loop twice, once to identify any matches on year and a second time to remove any non-matches. Is there a simpler way to do this?
I'm relatively new to R (and this is my first post here!) so I appreciate any ideas.

# testing data frame (actual data has many more items and years)
item <- c("ABC", "DEF", "GHI", "JKL", "MNO", "PQR", "STU")
x1970 <- c("abc70", "70def", "gh70i", "jkl70", "", "70pqr", "stu70x")
x1980 <- c("abc80", "80def", "", "", "", "80pqr", "stu80x")
x1990 <- c("abc90", "90def", "", "", "90mno90", "90pqr", "")
x2000 <- c("", "00def", "gh00i", "jkl00", "00mno00", "00pqr", "")
df <- data.frame(item, x1970, x1980, x1990, x2000)

df
  item  x1970  x1980   x1990   x2000
1  ABC  abc70  abc80   abc90        
2  DEF  70def  80def   90def   00def
3  GHI  gh70i                  gh00i
4  JKL  jkl70                  jkl00
5  MNO               90mno90 00mno00
6  PQR  70pqr  80pqr   90pqr   00pqr
7  STU stu70x stu80x 

year_filter_test <- function(data, year)
{
  # paste all year columns together in "search_columns" so I can apply grepl() to that one column
  data2 <- data %>% mutate(search_columns = paste(x1970, x1980, x1990, x2000) )
  
  # create "include" column which will be TRUE if I want to include this row
  data2$include <- rep( "", nrow(data2))
  
  # loop through each year, making include column TRUE if there's a match on ANY year
  for (i in year)
  { data2$include [ grepl(i, data2$search_columns) ] <- TRUE
  }
  
  # loop through each year again, making include column FALSE if there's NOT a match on ANY year
  # the effect of these two loops is to only include columns that match ALL years
  for (i in year)
  { data2$include [ !grepl(i, data2$search_columns) ] <- FALSE
  }
      
  selected.years <- data2 %>% filter(include==TRUE)
  return(selected.years)
}

CodePudding user response：

Something like this?

library(dplyr)
library(tidyr)

my_function <- function(df, x){
  df %>% 
    pivot_longer(
      cols=-item
    ) %>% 
    mutate(year = parse_number(value)) %>% 
    pivot_wider(
      names_from = name
    ) %>% 
    filter(year == {{x}})
}

my_function(df, c("70", "90"))

  item   year x1970 x1980 x1990   x2000
  <chr> <dbl> <chr> <chr> <chr>   <chr>
1 ABC      70 abc70 NA    NA      NA   
2 DEF      70 70def NA    NA      NA   
3 GHI      70 gh70i NA    NA      NA   
4 MNO      90 NA    NA    90mno90 NA   
5 PQR      90 NA    NA    90pqr   NA

CodePudding user response：

Using across

library(dplyr)
library(stringr)
my_func <- function(dat, years) {
dat %>%
  filter(across(num_range('x', years), ~ str_remove_all(., "\\d ") == tolower(item)))
 
}

-testing

> my_func(df, c('1970', "1990"))
  item x1970 x1980 x1990 x2000
1  ABC abc70 abc80 abc90      
2  DEF 70def 80def 90def 00def
3  PQR 70pqr 80pqr 90pqr 00pqr

CodePudding user response：

We can use dplyr with rowwise, if_all, and stringr::str_detect:

library(dplyr)
library(stringr)

my_func<-function(df, years){
  df%>%
    rowwise()%>%
    mutate(index=if_all(num_range('x', years), ~str_detect(tolower(.x), tolower(item))))
}

my_func(df, c(1970, 1990))

# A tibble: 7 × 6
# Rowwise: 
  item  x1970    x1980    x1990     x2000     index
  <chr> <chr>    <chr>    <chr>     <chr>     <lgl>
1 ABC   "abc70"  "abc80"  "abc90"   ""        TRUE 
2 DEF   "70def"  "80def"  "90def"   "00def"   TRUE 
3 GHI   "gh70i"  ""       ""        "gh00i"   FALSE
4 JKL   "jkl70"  ""       ""        "jkl00"   FALSE
5 MNO   ""       ""       "90mno90" "00mno00" FALSE
6 PQR   "70pqr"  "80pqr"  "90pqr"   "00pqr"   TRUE 
7 STU   "stu70x" "stu80x" ""        ""        FALSE

We can easily use the new logical column to perform filtering operations.

If we want to filter directly, we can modify the function:

my_filter<-function(df, years){
  df%>%
    rowwise()%>%
    filter(if_all(num_range('x', years), ~str_detect(tolower(.x), tolower(item))))
}

my_filter(df=df, years=c(1970, 1990))

# A tibble: 3 × 5
# Rowwise: 
  item  x1970 x1980 x1990 x2000  
  <chr> <chr> <chr> <chr> <chr>  
1 ABC   abc70 abc80 abc90 ""     
2 DEF   70def 80def 90def "00def"
3 PQR   70pqr 80pqr 90pqr "00pqr"

CodePudding user response：

This was a bit unclear, but it sounds like you're looking to match the values in your data.frame with your function, which contain the last 2 digits of years, and not look at the column name which also contains years.

If that is correct, you could use rowSums to add up the partial matches found for character values in a given row. If the sum is the same as the number of "years" that you are searching for, then that row will be retained in your filter.

library(tidyverse)

year_filter <- function(data, years) {
  data %>%
    filter(rowSums(
      sapply(select(., starts_with("x")),
             function(x) grepl(paste(years, collapse = "|"), x))
    ) == length(years))
}
  
year_filter(df, c("70", "90"))

Output

  item x1970 x1980 x1990 x2000
1  ABC abc70 abc80 abc90      
2  DEF 70def 80def 90def 00def
3  PQR 70pqr 80pqr 90pqr 00pqr

Edit: Given that there may be more than one column per year, the column year should be considered in the function.

Here is an alternative. First, allow the function to use full 4 digit years, and create a variable to store the last 2 digits for searching.

Put your data into long format, and extract the years from the column names (so that 1970.a and 1970.b would be considered for 1970).

Then, grouping by each row of the data.frame, filter first by including matches with the last 2 digits of the year, and then making sure all years are found in the row. In the end, you will have row numbers of data to return based on what remains after the filter.

year_filter <- function(data, years) {
  last2dig_yrs <- substr(years, nchar(years) - 1, nchar(years))
  match_rows <- data %>%
    mutate(rn = row_number()) %>%
    pivot_longer(cols = -c(item, rn)) %>%
    mutate(year = parse_number(name)) %>%
    group_by(rn) %>%
    filter(grepl(paste(last2dig_yrs, collapse = "|"), value)) %>%
    filter(all(years %in% year)) %>%
    distinct(rn) %>%
    pull(rn)
  return(data[match_rows,])
}

I created example data with 1970.a and 1970.b columns.

Example Data

df <- structure(list(item = c("ABC", "DEF", "GHI", "JKL", "MNO", "PQR", 
"STU"), x1970.a = c("abc70", "70def", "gh70i", "jkl70", "", "70pqr", 
"stu70x"), x1970.b = c("zxc70", "def", "gh70i", "70d", "70x", 
"", "stu70x"), x1980 = c("abc80", "80def", "", "", "", "80pqr", 
"stu80x"), x1990 = c("abc90", "90def", "", "", "90mno90", "90pqr", 
""), x2000 = c("", "00def", "gh00i", "jkl00", "00mno00", "00pqr", 
"")), class = "data.frame", row.names = c(NA, -7L))

year_filter(df, c("1970", "1990"))

Output

  item x1970.a x1970.b x1980   x1990   x2000
1  ABC   abc70   zxc70 abc80   abc90        
2  DEF   70def     def 80def   90def   00def
5  MNO             70x       90mno90 00mno00
6  PQR   70pqr         80pqr   90pqr   00pqr