Home > Back-end >  Row-wise sum of paired values in 2 identically labelled lists
Row-wise sum of paired values in 2 identically labelled lists

Time:07-25

I have two lists Sis1 and Sis2 with identical column names and identical first column values but different values in columns 2:n:

 Time    Nuc_11    Nuc_15    Nuc_16
 835.20  71.00670 101.13133  89.51763
 839.84 103.49544 146.28099  77.43385
 844.48 142.79152  85.98028  83.36374
 849.12  82.49453  90.12871  69.49881
 853.76  63.62815  77.91290 104.42857
 858.40  66.04051 110.48225 108.12739

What I want is to have a row-wise sum of Sis1 and Sis2 for each identically named pair of columns in the two lists (ie. sum of Sis1$Nuc_11 and Sis2$Nuc_11 for each row defined by the shared Time value). I'm missing something obvious. My best guess is:

NucID <- colnames(Sis1) 
SisSum <- bind_cols(Sis1, Sis2) %>%
      group_by(!!!NucID) %>%
      summarise(across(everything(),list("SisSum" = sum)))

but that's not correct. I'm not sworn to dplyr for the answer, so if there's a nicer approach I'll take it.

Thanks in advance for the help :)

CodePudding user response:

  1. Since you say that they should be added "by the shared Time value", we cannot use bind_cols, which would require perfect alignment and shared-existence of all Time values. (If this is always the case, then just ignore it ... and pray that your assumption is always true.) I suggest a merge/join operation.

    For discussion on merge/join operations, see How to join (merge) data frames (inner, outer, left, right) and What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?.

  2. Further, joining on a floating-point is subject to a fundamental problem in programming (not just R), testing for perfect equality in floating-point numbers. To work around this, we need to determine a precision level at which two numbers are assumed to be effectively the same. (Note that often R can get floating-point intuitively correct, but there is no clear indication when it does not work.)

    For more discussion on issues regarding floating-point equality, test (0.10 0.05) == 0.15 (returns false), and then read Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754.

  3. Doing a merge/join based on common columns will be much simpler if we first pivot the data from the current wide format to a long format. (This relies on all of them being the same class, numeric in this case. If there is mixed-class, then this does not work.)

    For discussion on pivoting, see Reshaping data.frame from wide to long format (wide-to-long) and the reverse, Reshape multiple value columns to wide format (long to wide).

I'll create a similarly-shaped Sis2, changing the numbers, and changing one of the Time values just a little to show what happens.

set.seed(42)
Sis2 <- Sis1 %>%
  mutate(across(starts_with("Nuc"), ~ runif(n(), 50, 100)))
Sis2$Time[3] <- Sis2$Time[3]   1e-5
Sis2
#     Time   Nuc_11   Nuc_15   Nuc_16
# 1 835.20 95.74030 86.82942 96.73361
# 2 839.84 96.85377 56.73333 62.77144
# 3 844.48 64.30698 82.84961 73.11464
# 4 849.12 91.52238 85.25324 97.00073
# 5 853.76 82.08728 72.88709 98.91132
# 6 858.40 75.95480 85.95561 55.87437

Despite the fact that they look similar, one of the Time values is in fact different, and will not join properly.

Sis1$Time
# [1] 835.20 839.84 844.48 849.12 853.76 858.40
Sis2$Time
# [1] 835.20 839.84 844.48 849.12 853.76 858.40

Sis1$Time == Sis2$Time
# [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

Assuming that 0.001 is sufficient precision to determine as "effectively identical", here is working code that will work:

library(dplyr)
# library(tidyr) # pivot_*
full_join(
    Sis1 %>%
      mutate(Time_chr = sprintf("%0.03f", Time)) %>%
      tidyr::pivot_longer(-c(Time, Time_chr)),
    Sis2 %>%
      mutate(Time_chr = sprintf("%0.03f", Time)) %>%
      tidyr::pivot_longer(-c(Time, Time_chr)),
    by = c("Time_chr", "name")) %>%
  transmute(
    Time = coalesce(Time.x, Time.y),
    name,
    values = rowSums(cbind(value.x, value.y), na.rm = TRUE)
  ) %>%
  tidyr::pivot_wider(Time, names_from = "name", values_from = "values") %>%
  as.data.frame()
#     Time   Nuc_11   Nuc_15   Nuc_16
# 1 835.20 166.7470 187.9607 186.2512
# 2 839.84 200.3492 203.0143 140.2053
# 3 844.48 207.0985 168.8299 156.4784
# 4 849.12 174.0169 175.3819 166.4995
# 5 853.76 145.7154 150.8000 203.3399
# 6 858.40 141.9953 196.4379 164.0018

(The use of %>% as.data.frame() is purely to show all digits and rows without tibble's convenient brief-view. It is not required. Also ... I'm generally not a fan of nesting %>%-pipes inside function calls like that, but it shows what's being done well, I think; feel free to use temporary variables for pivoted storage if you like.)

Because we're doing a full-join, if there were any mismatches then we would see NA values. Such as this, without the use of converting Time to a string:

full_join(
    tidyr::pivot_longer(Sis1, -Time),
    tidyr::pivot_longer(Sis2, -Time),
    by = c("Time", "name")
  ) %>%
  transmute(
    Time, 
    name, 
    values = rowSums(cbind(value.x, value.y), na.rm = TRUE)
  ) %>%
  tidyr::pivot_wider(Time, names_from = "name", values_from = "values") %>%
  as.data.frame()
#     Time    Nuc_11    Nuc_15    Nuc_16
# 1 835.20 166.74700 187.96075 186.25124
# 2 839.84 200.34921 203.01432 140.20529
# 3 844.48 142.79152  85.98028  83.36374
# 4 849.12 174.01691 175.38195 166.49954
# 5 853.76 145.71543 150.79999 203.33989
# 6 858.40 141.99531 196.43786 164.00176
# 7 844.48  64.30698  82.84961  73.11464

Notice that 844.48 in row 3 and its similar-looking "844.48" in row 7 have not been added, due to the 1e-5 difference I added. You can get R to show them with more precision (see ?options for the 'digits' and 'scipen' arguments).

CodePudding user response:

library(tidyverse)

list <- list(
  read_table(
    " Time    Nuc_11    Nuc_15    Nuc_16
 835.20  71.00670 101.13133  89.51763
 839.84 103.49544 146.28099  77.43385
 844.48 142.79152  85.98028  83.36374
 849.12  82.49453  90.12871  69.49881
 853.76  63.62815  77.91290 104.42857
 858.40  66.04051 110.48225 108.12739"
  ),
  read_table(
    " Time    Nuc_11    Nuc_15    Nuc_16
 835.20  71.00670 101.13133  89.51763
 839.84 103.49544 146.28099  77.43385
 844.48 142.79152  85.98028  83.36374
 849.12  82.49453  90.12871  69.49881
 853.76  63.62815  77.91290 104.42857
 858.40  66.04051 110.48225 108.12739"
  )
)

do.call(rbind, list) %>% 
  pivot_longer(-Time) %>% 
  group_by(Time, name) %>% 
  summarise(sum = sum(value))

# A tibble: 18 × 3
# Groups:   Time [6]
    Time name     sum
   <dbl> <chr>  <dbl>
 1  835. Nuc_11  142.
 2  835. Nuc_15  202.
 3  835. Nuc_16  179.
 4  840. Nuc_11  207.
 5  840. Nuc_15  293.
 6  840. Nuc_16  155.
 7  844. Nuc_11  286.
 8  844. Nuc_15  172.
 9  844. Nuc_16  167.
10  849. Nuc_11  165.
11  849. Nuc_15  180.
12  849. Nuc_16  139.
13  854. Nuc_11  127.
14  854. Nuc_15  156.
15  854. Nuc_16  209.
16  858. Nuc_11  132.
17  858. Nuc_15  221.
18  858. Nuc_16  216.
  • Related