Home > Enterprise >  R: new variable with values from another variable based on lowest value in a third variable, grouped
R: new variable with values from another variable based on lowest value in a third variable, grouped

Time:09-23

i am working with quite a large longitudinal dataset in which the same individuals are tested at more than one timepoint (in long format so that one individual has more than one row). i would like to create new variables with the prefix "first_" (for example on the variable "age", making the new variable "first_age"). i would like this first-variable to have the age from the first timepoint this individual was tested (ie the lowest value on "timepoint"). also, i want this new variable to be non-timevarying, so that the same "first age" is the value on each row for that specific individual.

hopefully i can make this clearer with some example data:


    ID <- c(1010, 1010, 1011, 1011, 1012, 1012, 1013, 1013)
    timepoint <- c(2, 5, 3, 6, 5, 6, 3, 5)
    age <- c(55, 59, 60, 67, 71, 73, 58, 64)
    first_age <- c(55, 55, 60, 60, 71, 71, 58, 58)
    cell_prop <- c(0.7, 0.1, 0.5, 0.6, 0.7, 0.4, 0.8, 0.1)
    first_cell <- c(0.7, 0.7, 0.5, 0.5, 0.7, 0.7, 0.8, 0.8)
    
    df<-data.frame(matrix(ncol = 0, nrow = length(ID)))
    df$ID <- ID
    
    df <- df %>% 
      mutate(timepoint = timepoint,
             age = age,
             cell_prop = cell_prop)

based on this, i want to create two new variables, called "first_age" and "first_cell". i want these to have the values from "age" and "cell_prop", respectively, from the lowest timepoint-value for each ID. so for ID 1010 i want the "first_age" to be 55 in both 1010-rows, and the "first_cell" to be 0.7 in both.

basically, i want it to look like this (with additional variables "first_age" and "first_cell"):

    df <- df %>%
  mutate(first_age = first_age,
         first_cell = first_cell)

does someone know how i would do this, preferably using tidyverse?

it works to do it like this:


    df <- df %>%
      group_by(ID) %>%
      mutate(first_age = min(age))

the problem here is that the new value in "first_age" is based on the lowest value in "age". it is fine for this variable (since age doesn't decrease with time), BUT, in the variable "cell_prop" there are cases when the lowest value is measured at a later timepoint, which won't work. i still want the values from the first timepoint.

i have also tried to do it like this:


    df <- df %>%
      group_by(ID) %>%
      mutate(first_age = age==min(timepoint))

but here, the new variable "first_age" only has "FALSE" as values for every row.

what i would like to do is to write it like this (in words):

df <- df %>% group_by(ID) %>% mutate(first_age = age at min(timepoint))

it feels like this shouldn't be that hard, but i can't figure out how to code this "at", indicating that i want the value in the new variable to be the same as in "age" at the lowest value in "timepoint".

can someone please help me? i am really new to R programming so just let me know if there is something missing in my question! super grateful for help

CodePudding user response:

Using fill() from the tidyr-package:

library(dplyr)
library(tidyr)

df %>% group_by(ID) %>% 
  mutate(first_age = ifelse(timepoint == min(timepoint),age,NA),
         first_cell_prop = ifelse(timepoint == min(timepoint),cell_prop,NA)) %>% 
  fill(first_age,first_cell_prop)

Which gives

# A tibble: 8 x 6
# Groups:   ID [4]
     ID timepoint   age cell_prop first_age first_cell_prop
  <dbl>     <dbl> <dbl>     <dbl>     <dbl>           <dbl>
1  1010         2    55       0.7        55             0.7
2  1010         5    59       0.1        55             0.7
3  1011         3    60       0.5        60             0.5
4  1011         6    67       0.6        60             0.5
5  1012         5    71       0.7        71             0.7
6  1012         6    73       0.4        71             0.7
7  1013         3    58       0.8        58             0.8
8  1013         5    64       0.1        58             0.8

This should work, assuming you don't have duplicate timepoints with different age and cell_prop values

CodePudding user response:

Another approach is to make a summary table of the minimum values and join them back in. I find this approach a bit safer than filling because you don't have to worry about something going wrong with the sort order of your data (easily corrected by sorting, of course).

library(dplyr)

group_mins <- df %>% 
  group_by(ID) %>% 
  slice(which.min(timepoint)) %>% 
  rename(first_age = age,
         first_cell_prop = cell_prop) %>% 
  select(-timepoint)

df %>% 
  left_join(group_mins, by = c("ID"))

#     ID timepoint age cell_prop first_age first_cell_prop
# 1 1010         2  55       0.7        55             0.7
# 2 1010         5  59       0.1        55             0.7
# 3 1011         3  60       0.5        60             0.5
# 4 1011         6  67       0.6        60             0.5
# 5 1012         5  71       0.7        71             0.7
# 6 1012         6  73       0.4        71             0.7
# 7 1013         3  58       0.8        58             0.8
# 8 1013         5  64       0.1        58             0.8
  •  Tags:  
  • r
  • Related