Home > Enterprise >  How to select only those observations that are available in both measurements?
How to select only those observations that are available in both measurements?

Time:09-16

I have a dataset which has tree measurements at time t1 and t2. These trees are identified by state, county, plot and tree number. There are some trees that have died and some have grown in the time interval between t1 and t2.

State   County   Plot    Tree     Meas_yr
1       9        1       1        t1 
1       9        1       2        t1
1       9        1       3        t1
1       9        1       1        t2
1       9        1       2        t2
1       9        1       4        t2
1       9        1       5        t2    

I am trying to separate out the trees the new grown trees from the trees that were present in both t1 and t2 and the trees that have died in t2. I am trying to create something like below.

State   County   Plot    Tree     Meas_yr  tree_survival
1       9        1       1        t1       1
1       9        1       2        t1       1
1       9        1       3        t1       0
1       9        1       1        t2       1
1       9        1       2        t2       1

I would really appreciate the help. Thanks in advance.

CodePudding user response:

You may try this -

library(dplyr)

df %>%
  group_by(State, County, Plot, Tree) %>%
  dplyr::filter(n() > 1 | Meas_yr == "t1") %>%
  mutate(tree_survival = as.integer(n() > 1)) %>%
  ungroup()

#   State County  Plot  Tree Meas_yr tree_survival
#  <int>  <int> <int> <int> <chr>           <int>
#1     1      9     1     1 t1                  1
#2     1      9     1     2 t1                  1
#3     1      9     1     3 t1                  0
#4     1      9     1     1 t2                  1
#5     1      9     1     2 t2                  1

This keeps all the rows for "t1" year and also the trees that are present in both the year. tree_survival is assigned 1 if the tree is present in both the years or 0 otherwise.

data

df <- structure(list(State = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), County = c(9L, 
9L, 9L, 9L, 9L, 9L, 9L), Plot = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), 
    Tree = c(1L, 2L, 3L, 1L, 2L, 4L, 5L), Meas_yr = c("t1", "t1", 
    "t1", "t2", "t2", "t2", "t2")), class = "data.frame", row.names = c(NA, -7L))

CodePudding user response:

library(tidyverse)

data <- tribble(
~State,   ~County,   ~Plot,    ~Tree,     ~Meas_yr,
1,       9,        1,       1,        't1', 
1,       9,        1,       2,        't1',
1,       9,        1,       3,        't1',
1,       9,        1,       1,        't2',
1,       9,        1,       2,        't2',
1,       9,        1,       4,        't2',
1,       9,        1,       5,        't2'  
)

new_df <- data %>%
    group_by(Tree) %>%
    mutate(tree_survival = n() - 1) %>%
    ungroup() %>%
    filter(!(Meas_yr == 't2' & tree_survival == 0))
  • Related