I have two lists Sis1 and Sis2 with identical column names and identical first column values but different values in columns 2:n:
Time Nuc_11 Nuc_15 Nuc_16
835.20 71.00670 101.13133 89.51763
839.84 103.49544 146.28099 77.43385
844.48 142.79152 85.98028 83.36374
849.12 82.49453 90.12871 69.49881
853.76 63.62815 77.91290 104.42857
858.40 66.04051 110.48225 108.12739
What I want is to have a row-wise sum of Sis1 and Sis2 for each identically named pair of columns in the two lists (ie. sum of Sis1$Nuc_11 and Sis2$Nuc_11 for each row defined by the shared Time value). I'm missing something obvious. My best guess is:
NucID <- colnames(Sis1)
SisSum <- bind_cols(Sis1, Sis2) %>%
group_by(!!!NucID) %>%
summarise(across(everything(),list("SisSum" = sum)))
but that's not correct. I'm not sworn to dplyr for the answer, so if there's a nicer approach I'll take it.
Thanks in advance for the help :)
CodePudding user response:
Since you say that they should be added "by the shared Time value", we cannot use
bind_cols
, which would require perfect alignment and shared-existence of allTime
values. (If this is always the case, then just ignore it ... and pray that your assumption is always true.) I suggest a merge/join operation.For discussion on merge/join operations, see How to join (merge) data frames (inner, outer, left, right) and What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?.
Further, joining on a floating-point is subject to a fundamental problem in programming (not just R), testing for perfect equality in floating-point numbers. To work around this, we need to determine a precision level at which two numbers are assumed to be effectively the same. (Note that often R can get floating-point intuitively correct, but there is no clear indication when it does not work.)
For more discussion on issues regarding floating-point equality, test
(0.10 0.05) == 0.15
(returns false), and then read Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754.Doing a merge/join based on common columns will be much simpler if we first pivot the data from the current wide format to a long format. (This relies on all of them being the same class,
numeric
in this case. If there is mixed-class, then this does not work.)For discussion on pivoting, see Reshaping data.frame from wide to long format (wide-to-long) and the reverse, Reshape multiple value columns to wide format (long to wide).
I'll create a similarly-shaped Sis2
, changing the numbers, and changing one of the Time
values just a little to show what happens.
set.seed(42)
Sis2 <- Sis1 %>%
mutate(across(starts_with("Nuc"), ~ runif(n(), 50, 100)))
Sis2$Time[3] <- Sis2$Time[3] 1e-5
Sis2
# Time Nuc_11 Nuc_15 Nuc_16
# 1 835.20 95.74030 86.82942 96.73361
# 2 839.84 96.85377 56.73333 62.77144
# 3 844.48 64.30698 82.84961 73.11464
# 4 849.12 91.52238 85.25324 97.00073
# 5 853.76 82.08728 72.88709 98.91132
# 6 858.40 75.95480 85.95561 55.87437
Despite the fact that they look similar, one of the Time
values is in fact different, and will not join properly.
Sis1$Time
# [1] 835.20 839.84 844.48 849.12 853.76 858.40
Sis2$Time
# [1] 835.20 839.84 844.48 849.12 853.76 858.40
Sis1$Time == Sis2$Time
# [1] TRUE TRUE FALSE TRUE TRUE TRUE
Assuming that 0.001
is sufficient precision to determine as "effectively identical", here is working code that will work:
library(dplyr)
# library(tidyr) # pivot_*
full_join(
Sis1 %>%
mutate(Time_chr = sprintf("%0.03f", Time)) %>%
tidyr::pivot_longer(-c(Time, Time_chr)),
Sis2 %>%
mutate(Time_chr = sprintf("%0.03f", Time)) %>%
tidyr::pivot_longer(-c(Time, Time_chr)),
by = c("Time_chr", "name")) %>%
transmute(
Time = coalesce(Time.x, Time.y),
name,
values = rowSums(cbind(value.x, value.y), na.rm = TRUE)
) %>%
tidyr::pivot_wider(Time, names_from = "name", values_from = "values") %>%
as.data.frame()
# Time Nuc_11 Nuc_15 Nuc_16
# 1 835.20 166.7470 187.9607 186.2512
# 2 839.84 200.3492 203.0143 140.2053
# 3 844.48 207.0985 168.8299 156.4784
# 4 849.12 174.0169 175.3819 166.4995
# 5 853.76 145.7154 150.8000 203.3399
# 6 858.40 141.9953 196.4379 164.0018
(The use of %>% as.data.frame()
is purely to show all digits and rows without tibble
's convenient brief-view. It is not required. Also ... I'm generally not a fan of nesting %>%
-pipes inside function calls like that, but it shows what's being done well, I think; feel free to use temporary variables for pivoted storage if you like.)
Because we're doing a full-join, if there were any mismatches then we would see NA
values. Such as this, without the use of converting Time
to a string:
full_join(
tidyr::pivot_longer(Sis1, -Time),
tidyr::pivot_longer(Sis2, -Time),
by = c("Time", "name")
) %>%
transmute(
Time,
name,
values = rowSums(cbind(value.x, value.y), na.rm = TRUE)
) %>%
tidyr::pivot_wider(Time, names_from = "name", values_from = "values") %>%
as.data.frame()
# Time Nuc_11 Nuc_15 Nuc_16
# 1 835.20 166.74700 187.96075 186.25124
# 2 839.84 200.34921 203.01432 140.20529
# 3 844.48 142.79152 85.98028 83.36374
# 4 849.12 174.01691 175.38195 166.49954
# 5 853.76 145.71543 150.79999 203.33989
# 6 858.40 141.99531 196.43786 164.00176
# 7 844.48 64.30698 82.84961 73.11464
Notice that 844.48
in row 3 and its similar-looking "844.48" in row 7 have not been added, due to the 1e-5
difference I added. You can get R to show them with more precision (see ?options
for the 'digits'
and 'scipen'
arguments).
CodePudding user response:
library(tidyverse)
list <- list(
read_table(
" Time Nuc_11 Nuc_15 Nuc_16
835.20 71.00670 101.13133 89.51763
839.84 103.49544 146.28099 77.43385
844.48 142.79152 85.98028 83.36374
849.12 82.49453 90.12871 69.49881
853.76 63.62815 77.91290 104.42857
858.40 66.04051 110.48225 108.12739"
),
read_table(
" Time Nuc_11 Nuc_15 Nuc_16
835.20 71.00670 101.13133 89.51763
839.84 103.49544 146.28099 77.43385
844.48 142.79152 85.98028 83.36374
849.12 82.49453 90.12871 69.49881
853.76 63.62815 77.91290 104.42857
858.40 66.04051 110.48225 108.12739"
)
)
do.call(rbind, list) %>%
pivot_longer(-Time) %>%
group_by(Time, name) %>%
summarise(sum = sum(value))
# A tibble: 18 × 3
# Groups: Time [6]
Time name sum
<dbl> <chr> <dbl>
1 835. Nuc_11 142.
2 835. Nuc_15 202.
3 835. Nuc_16 179.
4 840. Nuc_11 207.
5 840. Nuc_15 293.
6 840. Nuc_16 155.
7 844. Nuc_11 286.
8 844. Nuc_15 172.
9 844. Nuc_16 167.
10 849. Nuc_11 165.
11 849. Nuc_15 180.
12 849. Nuc_16 139.
13 854. Nuc_11 127.
14 854. Nuc_15 156.
15 854. Nuc_16 209.
16 858. Nuc_11 132.
17 858. Nuc_15 221.
18 858. Nuc_16 216.