Home > Software engineering >  Duplicate values when using inner_join()
Duplicate values when using inner_join()

Time:10-11

I am trying to use a t.test() to compare means for multiple columns in a dataframe where the values to compare are within each column. I have several columns of metadata (Date,Assay,Timing) for each row. My data looks like df below where the data collected is not paired and meas1 and meas2 are different measured outcomes that are not related. The comparison I am trying to make is between meas1[Timing=="Start"] and meas1[Timing == "End"] on each date, for each assay, and each test. My actual data has ~10 columns of measurement data which affects my syntax for some of the subsetting.

library(tidyverse)

df <- data.frame(Date=rep(c("2022-01-01","2022-01-02"), each = 18),
    Assay = rep(c("Gly", "Asp", "Con"), each = 3, times = 4),
    Timing = c(rep("Start",9),rep("End",9)),
    meas1=round(rnorm(36,5,3),0),
    meas2=round(rnorm(36,8,9),0))

I have tried a couple of different approaches. One is to attempt to use inner_join() and pivot_longer() to join the data together using a separate dataframe of the metadata, but I am not getting the desired outcome.

comp <- list(Assay = c("Gly","Asp","Con"),
    first = "Start",
    last = "End",
    test = names(df %>% select(-Date,-Assay,-Timing))) %>%
    cross_df()

df_pivot <- df %>%
    pivot_longer(c(-Date,-Assay,-Timing), names_to = "test")

t_tests <- comp %>%
    inner_join(df_pivot, by = c("Assay", "test", "first"="Timing")) %>%
    rename(initial = value) %>%
    inner_join(df_pivot, by = c("Date", "Assay", "test", "last"="Timing")) %>%
    rename(final = value)

t_tests

# A tibble: 108 × 7
   Assay first last  test  Date       initial final
   <chr> <chr> <chr> <chr> <chr>        <dbl> <dbl>
 1 Gly   Start End   meas1 2022-01-01       8     8
 2 Gly   Start End   meas1 2022-01-01       8     9
 3 Gly   Start End   meas1 2022-01-01       8     4
 4 Gly   Start End   meas1 2022-01-01       4     8
 5 Gly   Start End   meas1 2022-01-01       4     9
 6 Gly   Start End   meas1 2022-01-01       4     4
 7 Gly   Start End   meas1 2022-01-01      -1     8
 8 Gly   Start End   meas1 2022-01-01      -1     9
 9 Gly   Start End   meas1 2022-01-01      -1     4
10 Gly   Start End   meas1 2022-01-02       6     1
# … with 98 more rows
# ℹ Use `print(n = ...)` to see more rows

The initial values are repeated for each distinct final value, which is not what I want since the data aren't paired. I am trying to get only 36 rows: 2 dates, 3 assays, 2 tests, 6 values for each test (3 values by 2 columns). In other words, rows 1:9 should be condensed to 3 rows (row 1, 5, and 9) containing only the unique initial and final values. This is where I need help. The 1,5,9 pattern should repeat but I was hoping to avoid slicing the data after the fact.

Assuming that part was done correctly, I'd proceed as follows, which gives me the summary tibble of t.test() outcomes that I want:


t_tests <- t_tests %>%
    mutate(first = NULL, last = NULL) %>%
    group_by(Date,Assay,test) %>%
    group_modify(~broom::tidy(t.test(.x$initial,.x$final))) %>% ungroup()

t_tests
# A tibble: 12 × 13
   Date       Assay test    estimate estimate1 estimate2 statistic   p.value parameter conf.low conf.high method                  alternative
   <chr>      <chr> <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl> <chr>                   <chr>      
 1 2022-01-01 Asp   values1   -2.33       5.67     8        -1.79  0.0989        11.7   -5.18       0.511 Welch Two Sample t-test two.sided  
 2 2022-01-01 Asp   values2    9.33       6.67    -2.67      2.14  0.0643         8.16  -0.698     19.4   Welch Two Sample t-test two.sided  
 3 2022-01-01 Con   values1    3.67       6.67     3         2.17  0.0552        10.1   -0.0984     7.43  Welch Two Sample t-test two.sided  
 4 2022-01-01 Con   values2   -8.33       2.67    11        -2.93  0.0110        14.1  -14.4       -2.23  Welch Two Sample t-test two.sided  
 5 2022-01-01 Gly   values1    0.333      5        4.67      0.343 0.737         13.1   -1.76       2.43  Welch Two Sample t-test two.sided  
 6 2022-01-01 Gly   values2   -0.333      5.67     6        -0.100 0.922         11.5   -7.63       6.96  Welch Two Sample t-test two.sided  
 7 2022-01-02 Asp   values1    2          6        4         1.36  0.193         16     -1.12       5.12  Welch Two Sample t-test two.sided  
 8 2022-01-02 Asp   values2   11         11.7      0.667     2.02  0.0731         9.27  -1.26      23.3   Welch Two Sample t-test two.sided  
 9 2022-01-02 Con   values1   -2          4.33     6.33     -1.75  0.0999        15.4   -4.43       0.429 Welch Two Sample t-test two.sided  
10 2022-01-02 Con   values2   11         11.3      0.333     5.64  0.0000761     13.2    6.79      15.2   Welch Two Sample t-test two.sided  
11 2022-01-02 Gly   values1   -2.33       3        5.33     -4.43  0.000594      13.8   -3.47      -1.20  Welch Two Sample t-test two.sided  
12 2022-01-02 Gly   values2    1          6        5         0.267 0.793         14.5   -7.00       9.00  Welch Two Sample t-test two.sided  


Thanks in advance!

CodePudding user response:

You need to add a run_id within each date/assay/timing group so that you can match use that as a join criteria to avoid the duplication.

There's clue, when you say

I am trying to get only 36 rows: 2 dates, 3 assays, 2 tests, 6 values for each test (3 values by 2 columns)

You have a date column with 2 unique dates, an assay column with 3 unique assays, a test column with 2 unique tests.... you also need a column with the 3 unique values for your "3 values by 2 columns". I'll call the column run_id.

I'll also skip the comp data frame, and essentially do a self-join:

pivot2 = df %>%
  group_by(Date, Assay, Timing) %>%
  mutate(run_id = row_number()) %>%
  ungroup() %>%
  pivot_longer(starts_with("meas"), names_to = "test") 

t_tests = 
  full_join(
    filter(pivot2, Timing == "Start") %>% select(-Timing, initial = value),
    filter(pivot2, Timing == "End") %>% select(-Timing, final = value),
    by = c("Date", "Assay", "run_id", "test")
  )
# # A tibble: 36 × 6
#    Date       Assay run_id test  initial final
#    <chr>      <chr>  <int> <chr>   <dbl> <dbl>
#  1 2022-01-01 Gly        1 meas1       1    -1
#  2 2022-01-01 Gly        1 meas2       4     7
#  3 2022-01-01 Gly        2 meas1       0     1
#  4 2022-01-01 Gly        2 meas2      10     8
#  5 2022-01-01 Gly        3 meas1       8     5
#  6 2022-01-01 Gly        3 meas2     -16     4
#  7 2022-01-01 Asp        1 meas1       6     7
#  8 2022-01-01 Asp        1 meas2      28    -5
#  9 2022-01-01 Asp        2 meas1       4     6
# 10 2022-01-01 Asp        2 meas2       9     9
# # … with 26 more rows
# # ℹ Use `print(n = ...)` to see more rows

I use a full_join so that even if one data/assay/timing combo has a different number of runs, everything will still be included.

  • Related