Home > Software engineering >  Assign values to a df$column with values calculated in another df in R
Assign values to a df$column with values calculated in another df in R

Time:06-14

Codes for the dfs are at the end.

I have two dataframes. The first df is meteo data from 3 different stations :

  site     date        temp
   X    2021-01-01      14
   X    2021-01-02      NA
   X    2021-01-03      10
   X    2021-01-04      14
   X    2021-01-05      10
   X    2021-01-06      10
   X    2021-01-07      13
   X    2021-01-08      12
   X    2021-01-09      13
   X    2021-01-10       7
   X    2021-01-11       9
   X    2021-01-12       6
   X    2021-01-13       8
   Y    2021-01-01      10
   Y    2021-01-02      14
   Y    2021-01-03       5
   Y    2021-01-04       7
   Y    2021-01-05       7
   Y    2021-01-06       9
   Y    2021-01-07       6
   Y    2021-01-08      12
   Y    2021-01-09      10
   Y    2021-01-10       9
   Y    2021-01-11      13
   Y    2021-01-12      13
   Y    2021-01-13      NA
   Y    2021-01-14       8
   Y    2021-01-15      11
   Y    2021-01-16       5
   Y    2021-01-17      11
   Y    2021-01-18      13
   Y    2021-01-19      11
   Y    2021-01-20       9
   Y    2021-01-21       9
   Y    2021-01-22       5
   Y    2021-01-23       6
   Y    2021-01-24      14
   Y    2021-01-25      10
   Y    2021-01-26       7
   Z    2021-01-01       9
   Z    2021-01-02      NA
   Z    2021-01-03      12
   Z    2021-01-04       6
   Z    2021-01-05       5
   Z    2021-01-06       7
   Z    2021-01-07       7
   Z    2021-01-08       5
   Z    2021-01-09       7
   Z    2021-01-10       7
   Z    2021-01-11      15
   Z    2021-01-12       8
   Z    2021-01-13       5
   Z    2021-01-14       6
   Z    2021-01-15       5
   Z    2021-01-16      12
   Z    2021-01-17       8
   Z    2021-01-18       7
   Z    2021-01-19       6
   Z    2021-01-20      13
   Z    2021-01-21      14
   Z    2021-01-22       8
   Z    2021-01-23      11
   Z    2021-01-24       7

The second df consists of observations made on the same site than the meteo stations. There is a trap at each station. Every couple days, the trap is emptied and the different species that were trapped are counted separately. For each site in df2, the date of pose is always the day after the date of withdrawal of the precedent entree (row). In this exemple, the species are in the obs column. They are named A, B, C, D, F and G. freq is the number of individuals that were trapped for that specie.

  site    pose        withdrawal    obs    freq
   X    2021-01-01    2021-01-03      A      31
   X    2021-01-01    2021-01-03      B      42
   X    2021-01-04    2021-01-05      A      14
   X    2021-01-06    2021-01-13      D      16
   X    2021-01-06    2021-01-13      F      36
   Y    2021-01-01    2021-01-04      G      49
   Y    2021-01-01    2021-01-04      A      29
   Y    2021-01-01    2021-01-04      C      45
   Y    2021-01-05    2021-01-14      D      25
   Y    2021-01-05    2021-01-14      A      50
   Y    2021-01-15    2021-01-14      B      40
   Y    2021-01-19    2021-01-26      B      39
   Z    2021-01-01    2021-01-03      C      25
   Z    2021-01-04    2021-01-05      F       3
   Z    2021-01-04    2021-01-05      B      16
   Z    2021-01-06    2021-01-14      C      19
   Z    2021-01-15    2021-01-19      A      12
   Z    2021-01-15    2021-01-19      B      26
   Z    2021-01-15    2021-01-19      F       2
   Z    2021-01-20    2021-01-24      A      24

I want to add a mean_T column to df2 where I would store the mean temperature for each entree in df2.

For ID = 1, the mean temperature would be calculated with the entrees 2021-01-01, 2021-01-02 and 2021-01-03 in df1, where site = 'X'.

With simpler dfs, I used this code the get the mean temperature. It works if I only have one entree per date, per site in df2, which is not the case.

df1 <- split(df1, with(df1, site), subset(df1, select = -site) )
df1 <- lapply(df1, function(x) x[(names(x) %in% c("ID", "date", "temp"))])

df2 <- split(df2, with(df2, site), subset(df2, select = -site) )
df2 <- lapply(df2, function(x) x[(names(x) %in% c("ID", "pose", "withdrawal"))])

 library(dplyr)
 library(tidyr)

Then, this code gave me the mean temperature. Credits go to @TarJae :

 mean_X <- df2$X %>% 
      pivot_longer(-ID, values_to = "date") %>% 
       full_join(df1$X, by= "date") %>% 
      arrange(date) %>% 
      fill(ID, .direction = "down") %>% 
       group_by(ID) %>% 
      summarise(mean_T = mean(temp, na.rm = TRUE)) %>% 
      left_join(df2$X, by="ID")

This chunk of code also worked credits go to @Jon Spring :

df2 %>%
    mutate(days = (withdrawal - pose   1) %>% as.integer) %>%
    tidyr::uncount(days, .id = "row") %>%
    transmute(ID, date = pose   row - 1) %>%
    left_join(df1) %>%
    group_by(ID) %>%
    summarize(mean_T = mean(temp)) %>% 
    right_join(df2)

Here is the code to generate the dfs :
df1 <- data.frame( site = c(rep('X', 13), rep('Y', 26), rep('Z', 24) ) ,
                     date = c( seq( as.Date("2021-01-01"), by="day", length.out=13 ),
                               seq( as.Date("2021-01-01"), by="day", length.out=26 ),
                               seq( as.Date("2021-01-01"), by="day", length.out=24 )) , 
                     temp = c(14, NA,   10, 14, 10, 10, 13, 12, 13, 7,  9,  6,  8,  10, 14, 5,  7,  7,  9,  6,  12,
                              10,   9,  13, 13, NA, 8,  11, 5,  11, 13, 11, 9,  9,  5,  6,  14, 10, 7,  9,  NA, 12, 
                               6,   5,  7,  7,  5,  7,  7,  15, 8,  5,  6, 5,   12, 8,  7,  6,  13, 14, 8,  11, 7) ) 

df2 <- data.frame( site = c( rep('X', 5), rep('Y', 7), rep('Z', 8) ) , 
                   pose = as.Date( c("2021-01-01", "2021-01-01", "2021-01-04", "2021-01-06", 
                                     "2021-01-06", "2021-01-01", "2021-01-01", "2021-01-01", 
                                     "2021-01-05", "2021-01-05", "2021-01-15", "2021-01-19" ,
                                     "2021-01-01", "2021-01-04", "2021-01-04", "2021-01-06",
                                     "2021-01-15", "2021-01-15", "2021-01-15", "2021-01-20") ) ,
                   withdrawal = as.Date( c( "2021-01-03", "2021-01-03", "2021-01-05", "2021-01-13", 
                                            "2021-01-13", "2021-01-04", "2021-01-04", "2021-01-04", 
                                            "2021-01-14", "2021-01-14", "2021-01-14", "2021-01-26" ,
                                            "2021-01-03", "2021-01-05", "2021-01-05", "2021-01-14",
                                            "2021-01-19", "2021-01-19", "2021-01-19", "2021-01-24" ) ) , 
                   obs = c( 'A', 'B', 'A', 'D', 'F', 'G', 'A', 'C', 'D', 'A', 'B', 'B' , 
                            'C', 'F', 'B', 'C', 'A', 'B', 'F', 'A') ,
                   freq = c(31, 42, 14, 16, 36, 49, 29, 45, 25, 50, 40, 39, 25, 3, 16, 19, 12, 26, 2, 24) ) 
df2 <- cbind(ID = 1:nrow(df2), df2)
English is not my first language. If something doesn't make sense, fell free to let me know in the comments.

CodePudding user response:

First I expand df2 to make a dataset with one row per day

df3 <- do.call(rbind,by(df2, 
   list(df2$ID), 
   function(d) data.frame(d,dates=d$pose:d$withdrawal)))

Now I merge df1 into this new dataset. I first need to convert the date to a numeric to match df3

df1$dates <- as.numeric(df1$date)
df4 <- merge(df1, df3,by=c("site", "dates"))

Now I can aggregate the new dataset by taking the mean temp over each day

aggregate(data=df4, temp ~ freq   site   obs   pose   withdrawal  ID, mean)      


   freq site obs       pose withdrawal ID      temp
1    31    X   A 2021-01-01 2021-01-03  1 12.000000
2    42    X   B 2021-01-01 2021-01-03  2 12.000000
3    14    X   A 2021-01-04 2021-01-05  3 12.000000
4    16    X   D 2021-01-06 2021-01-13  4  9.750000
5    36    X   F 2021-01-06 2021-01-13  5  9.750000
6    49    Y   G 2021-01-01 2021-01-04  6  9.000000
7    29    Y   A 2021-01-01 2021-01-04  7  9.000000
8    45    Y   C 2021-01-01 2021-01-04  8  9.000000
9    25    Y   D 2021-01-05 2021-01-14  9  9.666667
10   50    Y   A 2021-01-05 2021-01-14 10  9.666667
11   40    Y   B 2021-01-15 2021-01-14 11  9.500000
12   39    Y   B 2021-01-19 2021-01-26 12  8.875000
13   25    Z   C 2021-01-01 2021-01-03 13 10.500000
14    3    Z   F 2021-01-04 2021-01-05 14  5.500000
15   16    Z   B 2021-01-04 2021-01-05 15  5.500000
16   19    Z   C 2021-01-06 2021-01-14 16  7.444444
17   12    Z   A 2021-01-15 2021-01-19 17  7.600000
18   26    Z   B 2021-01-15 2021-01-19 18  7.600000
19    2    Z   F 2021-01-15 2021-01-19 19  7.600000
20   24    Z   A 2021-01-20 2021-01-24 20 10.600000
  • Related