Hopefully this is straightforward, and I'm just thinking too hard. I have a matrix of peak counts from mass spec (MS) where peaks are rows and columns are sample names. The sample locations have several sampling sites and I would like to add the counts between sites within locations.
For example, one sample with three replicates is identified as "S19S_0010_Sed_Field_ICR.D_p2", "S19S_0010_Sed_Field_ICR.M_p2", and "S19S_0010_Sed_Field_ICR.U_p2" where it's the same location but downstream (D), midstream (M), and upstream (U). The first two samples have one count of a specific peak each, so I would like to merge the three samples to just say "S19S_0010_Sed_Field_ICR.all_p2" with two counts of the wavelength. Example dataset:
> dput(data.sed.ex)
structure(list(S19S_0004_Sed_Field_ICR.M_p15 = c(0, 0, 0, 0,
0, 0, 0, 0, 0, 0), S19S_0006_Sed_Field_ICR.D_p2 = c(0, 0, 0,
0, 0, 0, 1, 1, 0, 0), S19S_0006_Sed_Field_ICR.M_p2 = c(0, 0,
0, 0, 0, 0, 1, 0, 0, 0), S19S_0006_Sed_Field_ICR.U_p2 = c(0,
0, 0, 0, 0, 0, 1, 1, 0, 0), S19S_0008_Sed_Field_ICR.M_p15 = c(0,
0, 0, 0, 0, 0, 0, 1, 0, 0), S19S_0009_Sed_Field_ICR.M_p2 = c(0,
0, 1, 0, 0, 0, 1, 0, 0, 0), S19S_0009_Sed_Field_ICR.U_p2 = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.D_p15 = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.M_p15 = c(0,
0, 0, 0, 0, 0, 1, 0, 0, 0), S19S_0010_Sed_Field_ICR.U_p15 = c(0,
0, 0, 0, 0, 0, 0, 0, 0, 0)), row.names = c("200.002276", "200.015107",
"200.0564158", "200.0565393", "200.0578394", "200.0677581", "200.092796",
"200.1291723", "200.1292836", "200.9238455"), class = "data.frame")
TIA
CodePudding user response:
maybe wrangling to a long format can help. In this format, you can summarize by groups e.g. sample, or sample, and location, using sum
, mean
, sd
among others.
hope this helps,
Convert to long format
## dd is the `data.sed.ex` object above
1 > library(tidyverse)
2 > ddLong <- dd %>%
3 rownames_to_column(var = "peak") %>% ## rownames to column
4 pivot_longer(cols = matches("^S")) %>% ## pivot longer
5 mutate(sample = gsub("(.*)\\.(.*)", "\\1", name), ## pull sample info
6 location = gsub("(.*)\\.([DMU])_(.*)", "\\2", name), ## pull D M U
7 p = gsub("(.*)\\.([DMU])_(p.*)", "\\3", name), ## get p2, p15
8 peak = as.numeric(peak)) ## coerce peak to numeric
9 > ddLong
10 # A tibble: 100 × 6
11 peak name value sample location p
12 <dbl> <chr> <dbl> <chr> <chr> <chr>
13 1 200. S19S_0004_Sed_Field_ICR.M_p15 0 S19S_0004_Sed_Field… M p15
14 2 200. S19S_0006_Sed_Field_ICR.D_p2 0 S19S_0006_Sed_Field… D p2
15 3 200. S19S_0006_Sed_Field_ICR.M_p2 0 S19S_0006_Sed_Field… M p2
16 4 200. S19S_0006_Sed_Field_ICR.U_p2 0 S19S_0006_Sed_Field… U p2
17 5 200. S19S_0008_Sed_Field_ICR.M_p15 0 S19S_0008_Sed_Field… M p15
18 6 200. S19S_0009_Sed_Field_ICR.M_p2 0 S19S_0009_Sed_Field… M p2
19 7 200. S19S_0009_Sed_Field_ICR.U_p2 0 S19S_0009_Sed_Field… U p2
20 8 200. S19S_0010_Sed_Field_ICR.D_p15 0 S19S_0010_Sed_Field… D p15
21 9 200. S19S_0010_Sed_Field_ICR.M_p15 0 S19S_0010_Sed_Field… M p15
22 10 200. S19S_0010_Sed_Field_ICR.U_p15 0 S19S_0010_Sed_Field… U p15
23 # … with 90 more rows
Summarize by one or more groups
24 > ## summarise using group_by verbs
25 > ddLong %>%
26 group_by(sample, location) %>%
27 summarise(n = n(),
28 sum.value = sum(value),
29 mean.peak = mean(peak))
30 `summarise()` has grouped output by 'sample'. You can override using the `.groups` argument.
31 # A tibble: 10 × 5
32 # Groups: sample [5]
33 sample location n sum.value mean.peak
34 <chr> <chr> <int> <dbl> <dbl>
35 1 S19S_0004_Sed_Field_ICR M 10 0 200.
36 2 S19S_0006_Sed_Field_ICR D 10 2 200.
37 3 S19S_0006_Sed_Field_ICR M 10 1 200.
38 4 S19S_0006_Sed_Field_ICR U 10 2 200.
39 5 S19S_0008_Sed_Field_ICR M 10 1 200.
40 6 S19S_0009_Sed_Field_ICR M 10 2 200.
41 7 S19S_0009_Sed_Field_ICR U 10 1 200.
42 8 S19S_0010_Sed_Field_ICR D 10 1 200.
43 9 S19S_0010_Sed_Field_ICR M 10 1 200.
44 10 S19S_0010_Sed_Field_ICR U 10 0 200.
45 > ddLong %>%
46 group_by(sample, p) %>%
47 summarise(n = n(),
48 sum.value = sum(value),
49 mean.peak = mean(peak))
50 `summarise()` has grouped output by 'sample'. You can override using the `.groups` argument.
51 # A tibble: 5 × 5
52 # Groups: sample [5]
53 sample p n sum.value mean.peak
54 <chr> <chr> <int> <dbl> <dbl>
55 1 S19S_0004_Sed_Field_ICR p15 10 0 200.
56 2 S19S_0006_Sed_Field_ICR p2 30 5 200.
57 3 S19S_0008_Sed_Field_ICR p15 10 1 200.
58 4 S19S_0009_Sed_Field_ICR p2 20 3 200.
59 5 S19S_0010_Sed_Field_ICR p15 30 2 200.
60 >