I'm working with a dataset that shows which variables exist in each year. I need a function that will allow me to specify certain years and then return the variables that appear in ALL of those years. In other words, if specify years "70" and "90", I want the function to find all rows with variables in both of those years (based on the variable names, not the column names).
This post gave me the idea to paste all of the year columns together so I only have to apply grepl() to one column. As you can see below, the only way I could figure to match ALL years was to run the for
loop twice, once to identify any matches on year and a second time to remove any non-matches. Is there a simpler way to do this?
I'm relatively new to R (and this is my first post here!) so I appreciate any ideas.
# testing data frame (actual data has many more items and years)
item <- c("ABC", "DEF", "GHI", "JKL", "MNO", "PQR", "STU")
x1970 <- c("abc70", "70def", "gh70i", "jkl70", "", "70pqr", "stu70x")
x1980 <- c("abc80", "80def", "", "", "", "80pqr", "stu80x")
x1990 <- c("abc90", "90def", "", "", "90mno90", "90pqr", "")
x2000 <- c("", "00def", "gh00i", "jkl00", "00mno00", "00pqr", "")
df <- data.frame(item, x1970, x1980, x1990, x2000)
df
item x1970 x1980 x1990 x2000
1 ABC abc70 abc80 abc90
2 DEF 70def 80def 90def 00def
3 GHI gh70i gh00i
4 JKL jkl70 jkl00
5 MNO 90mno90 00mno00
6 PQR 70pqr 80pqr 90pqr 00pqr
7 STU stu70x stu80x
year_filter_test <- function(data, year)
{
# paste all year columns together in "search_columns" so I can apply grepl() to that one column
data2 <- data %>% mutate(search_columns = paste(x1970, x1980, x1990, x2000) )
# create "include" column which will be TRUE if I want to include this row
data2$include <- rep( "", nrow(data2))
# loop through each year, making include column TRUE if there's a match on ANY year
for (i in year)
{ data2$include [ grepl(i, data2$search_columns) ] <- TRUE
}
# loop through each year again, making include column FALSE if there's NOT a match on ANY year
# the effect of these two loops is to only include columns that match ALL years
for (i in year)
{ data2$include [ !grepl(i, data2$search_columns) ] <- FALSE
}
selected.years <- data2 %>% filter(include==TRUE)
return(selected.years)
}
CodePudding user response:
Something like this?
library(dplyr)
library(tidyr)
my_function <- function(df, x){
df %>%
pivot_longer(
cols=-item
) %>%
mutate(year = parse_number(value)) %>%
pivot_wider(
names_from = name
) %>%
filter(year == {{x}})
}
my_function(df, c("70", "90"))
item year x1970 x1980 x1990 x2000
<chr> <dbl> <chr> <chr> <chr> <chr>
1 ABC 70 abc70 NA NA NA
2 DEF 70 70def NA NA NA
3 GHI 70 gh70i NA NA NA
4 MNO 90 NA NA 90mno90 NA
5 PQR 90 NA NA 90pqr NA
CodePudding user response:
Using across
library(dplyr)
library(stringr)
my_func <- function(dat, years) {
dat %>%
filter(across(num_range('x', years), ~ str_remove_all(., "\\d ") == tolower(item)))
}
-testing
> my_func(df, c('1970', "1990"))
item x1970 x1980 x1990 x2000
1 ABC abc70 abc80 abc90
2 DEF 70def 80def 90def 00def
3 PQR 70pqr 80pqr 90pqr 00pqr
CodePudding user response:
We can use dplyr with rowwise
, if_all
, and stringr::str_detect
:
library(dplyr)
library(stringr)
my_func<-function(df, years){
df%>%
rowwise()%>%
mutate(index=if_all(num_range('x', years), ~str_detect(tolower(.x), tolower(item))))
}
my_func(df, c(1970, 1990))
# A tibble: 7 × 6
# Rowwise:
item x1970 x1980 x1990 x2000 index
<chr> <chr> <chr> <chr> <chr> <lgl>
1 ABC "abc70" "abc80" "abc90" "" TRUE
2 DEF "70def" "80def" "90def" "00def" TRUE
3 GHI "gh70i" "" "" "gh00i" FALSE
4 JKL "jkl70" "" "" "jkl00" FALSE
5 MNO "" "" "90mno90" "00mno00" FALSE
6 PQR "70pqr" "80pqr" "90pqr" "00pqr" TRUE
7 STU "stu70x" "stu80x" "" "" FALSE
We can easily use the new logical column to perform filtering operations.
If we want to filter directly, we can modify the function:
my_filter<-function(df, years){
df%>%
rowwise()%>%
filter(if_all(num_range('x', years), ~str_detect(tolower(.x), tolower(item))))
}
my_filter(df=df, years=c(1970, 1990))
# A tibble: 3 × 5
# Rowwise:
item x1970 x1980 x1990 x2000
<chr> <chr> <chr> <chr> <chr>
1 ABC abc70 abc80 abc90 ""
2 DEF 70def 80def 90def "00def"
3 PQR 70pqr 80pqr 90pqr "00pqr"
CodePudding user response:
This was a bit unclear, but it sounds like you're looking to match the values in your data.frame with your function, which contain the last 2 digits of years, and not look at the column name which also contains years.
If that is correct, you could use rowSums
to add up the partial matches found for character values in a given row. If the sum is the same as the number of "years" that you are searching for, then that row will be retained in your filter.
library(tidyverse)
year_filter <- function(data, years) {
data %>%
filter(rowSums(
sapply(select(., starts_with("x")),
function(x) grepl(paste(years, collapse = "|"), x))
) == length(years))
}
year_filter(df, c("70", "90"))
Output
item x1970 x1980 x1990 x2000
1 ABC abc70 abc80 abc90
2 DEF 70def 80def 90def 00def
3 PQR 70pqr 80pqr 90pqr 00pqr
Edit: Given that there may be more than one column per year, the column year should be considered in the function.
Here is an alternative. First, allow the function to use full 4 digit years, and create a variable to store the last 2 digits for searching.
Put your data into long format, and extract the years from the column names (so that 1970.a and 1970.b would be considered for 1970).
Then, grouping by each row of the data.frame, filter first by including matches with the last 2 digits of the year, and then making sure all years are found in the row. In the end, you will have row numbers of data to return based on what remains after the filter.
year_filter <- function(data, years) {
last2dig_yrs <- substr(years, nchar(years) - 1, nchar(years))
match_rows <- data %>%
mutate(rn = row_number()) %>%
pivot_longer(cols = -c(item, rn)) %>%
mutate(year = parse_number(name)) %>%
group_by(rn) %>%
filter(grepl(paste(last2dig_yrs, collapse = "|"), value)) %>%
filter(all(years %in% year)) %>%
distinct(rn) %>%
pull(rn)
return(data[match_rows,])
}
I created example data with 1970.a and 1970.b columns.
Example Data
df <- structure(list(item = c("ABC", "DEF", "GHI", "JKL", "MNO", "PQR",
"STU"), x1970.a = c("abc70", "70def", "gh70i", "jkl70", "", "70pqr",
"stu70x"), x1970.b = c("zxc70", "def", "gh70i", "70d", "70x",
"", "stu70x"), x1980 = c("abc80", "80def", "", "", "", "80pqr",
"stu80x"), x1990 = c("abc90", "90def", "", "", "90mno90", "90pqr",
""), x2000 = c("", "00def", "gh00i", "jkl00", "00mno00", "00pqr",
"")), class = "data.frame", row.names = c(NA, -7L))
year_filter(df, c("1970", "1990"))
Output
item x1970.a x1970.b x1980 x1990 x2000
1 ABC abc70 zxc70 abc80 abc90
2 DEF 70def def 80def 90def 00def
5 MNO 70x 90mno90 00mno00
6 PQR 70pqr 80pqr 90pqr 00pqr