I am hoping someone can help me with this problem I am having working with a large dataset in R. I have a dataframe consisting of thousands of tree IDs, year of measurement, and diameter in the given measurement year. I would like to filter this dataframe so that I retain only rows predating the death of the neighboring tree. I have a second dataframe that contains tree IDs and the year the neighboring tree died, to provide the year to filter by.
Small bit of original df (4 trees only):
tree_id year diameter
1 T1 1978 48.2
2 T1 1990 48.6
3 T1 1995 49.0
4 T1 2002 49.6
5 T1 2008 50.3
6 T1 2012 50.4
7 T1 2017 50.6
8 T2 1978 76.3
9 T2 1984 76.8
10 T2 1990 77.3
11 T2 1995 78.7
12 T2 2002 79.5
13 T2 2008 80.6
14 T2 2012 81.1
15 T2 2017 81.6
16 T3 1978 15.7
17 T3 1984 16.5
18 T3 1990 17.7
19 T3 1995 18.3
20 T3 2002 19.3
21 T3 2008 20.0
22 T3 2012 20.0
23 T3 2017 20.2
24 T4 1978 50.5
25 T4 1984 51.2
26 T4 1990 51.9
27 T4 1995 52.5
28 T4 2002 53.2
29 T4 2008 54.8
30 T4 2012 53.7
31 T4 2017 54.0
And here is the dataframe I would like to filter the above by:
tree_id neb_death
1 T1 2002
2 T2 2008
3 T3 1995
4 T4 2012
For example, I would like to only retain rows in the larger dataframe for tree_id = T1 with measurement years before 2002. I'd really appreciate any help either with base R or dplyr approaches. Thanks!
CodePudding user response:
You can use data.table to join on matching tree_id
and year < neb_death
. If the first table is df
and the second is df2
:
library(data.table)
setDT(df)
setDT(df2)
df[df2, on = .(tree_id, year < neb_death)]
#> tree_id year diameter
#> <char> <int> <num>
#> 1: T1 2002 48.2
#> 2: T1 2002 48.6
#> 3: T1 2002 49.0
#> 4: T2 2008 76.3
#> 5: T2 2008 76.8
#> 6: T2 2008 77.3
#> 7: T2 2008 78.7
#> 8: T2 2008 79.5
#> 9: T3 1995 15.7
#> 10: T3 1995 16.5
#> 11: T3 1995 17.7
#> 12: T4 2012 50.5
#> 13: T4 2012 51.2
#> 14: T4 2012 51.9
#> 15: T4 2012 52.5
#> 16: T4 2012 53.2
#> 17: T4 2012 54.8
Created on 2021-12-20 by the reprex package (v2.0.1)
Data used
df <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1",
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3",
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4",
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L,
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L,
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L,
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2,
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5,
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2,
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), row.names = c(NA,
-31L), class = "data.frame")
df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L,
2008L, 1995L, 2012L)), row.names = c(NA, -4L), class = "data.frame")
CodePudding user response:
We could first left_join
by tree_id
and then filter
:
library(dplyr)
left_join(df, df1, by="tree_id") %>%
filter(year < neb_death) %>%
select(-neb_death)
output:
tree_id year diameter
<chr> <int> <dbl>
1 T1 1978 48.2
2 T1 1990 48.6
3 T1 1995 49
4 T2 1978 76.3
5 T2 1984 76.8
6 T2 1990 77.3
7 T2 1995 78.7
8 T2 2002 79.5
9 T3 1978 15.7
10 T3 1984 16.5
11 T3 1990 17.7
12 T4 1978 50.5
13 T4 1984 51.2
14 T4 1990 51.9
15 T4 1995 52.5
16 T4 2002 53.2
17 T4 2008 54.8
CodePudding user response:
Using base R we could loop row-wise over df2
, identify unwanted observations in df1
and remove them.
df1[-unlist(apply(df2, 1, \(x) which(df1$tree_id == x[1] & df1$year >= x[2]))), ]
# tree_id year diameter
# 1 T1 1978 48.2
# 2 T1 1990 48.6
# 3 T1 1995 49.0
# 8 T2 1978 76.3
# 9 T2 1984 76.8
# 10 T2 1990 77.3
# 11 T2 1995 78.7
# 12 T2 2002 79.5
# 16 T3 1978 15.7
# 17 T3 1984 16.5
# 18 T3 1990 17.7
# 24 T4 1978 50.5
# 25 T4 1984 51.2
# 26 T4 1990 51.9
# 27 T4 1995 52.5
# 28 T4 2002 53.2
# 29 T4 2008 54.8
Data:
df1 <- structure(list(tree_id = c("T1", "T1", "T1", "T1", "T1", "T1",
"T1", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T2", "T3", "T3",
"T3", "T3", "T3", "T3", "T3", "T3", "T4", "T4", "T4", "T4", "T4",
"T4", "T4", "T4"), year = c(1978L, 1990L, 1995L, 2002L, 2008L,
2012L, 2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L,
2017L, 1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L,
1978L, 1984L, 1990L, 1995L, 2002L, 2008L, 2012L, 2017L), diameter = c(48.2,
48.6, 49, 49.6, 50.3, 50.4, 50.6, 76.3, 76.8, 77.3, 78.7, 79.5,
80.6, 81.1, 81.6, 15.7, 16.5, 17.7, 18.3, 19.3, 20, 20, 20.2,
50.5, 51.2, 51.9, 52.5, 53.2, 54.8, 53.7, 54)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24",
"25", "26", "27", "28", "29", "30", "31"))
df2 <- structure(list(tree_id = c("T1", "T2", "T3", "T4"), neb_death = c(2002L,
2008L, 1995L, 2012L)), class = "data.frame", row.names = c("1",
"2", "3", "4"))