I have a dataset from mass spec measurement. So in this small subset there are rows or peptides which are repeated but with different intensity.
a <- dput(test_Data)
structure(list(UNIPROT = structure(c(2L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("A8DUK4", "P08032", "P15508"), class = "factor"),
Intensity = c(16926.19, 36738.94, 2203.22, 5338.85, 133.44,
27991.35, 29505.84, 201.4695, 47469.09, 24841.01, 4546.9,
22805.69, 18494.71, 28805.99, 68220.65, 90526.29, 63259.19,
44492.48, 65497.13, 40704.81, 334874.1, 38702.87, 300135)), class = "data.frame", row.names = c(NA,
-23L))
Data frame
UNIPROT Intensity
1 P08032 16926.1900
2 P08032 36738.9400
3 P08032 2203.2200
4 P08032 5338.8500
5 P08032 133.4400
6 P08032 27991.3500
7 P08032 29505.8400
8 P15508 201.4695
9 P15508 47469.0900
10 P15508 24841.0100
11 P15508 4546.9000
12 P15508 22805.6900
13 P15508 18494.7100
14 P15508 28805.9900
15 A8DUK4 68220.6500
16 A8DUK4 90526.2900
17 A8DUK4 63259.1900
18 A8DUK4 44492.4800
19 A8DUK4 65497.1300
20 A8DUK4 40704.8100
21 A8DUK4 334874.1000
22 A8DUK4 38702.8700
23 A8DUK4 300135.0000
So My objective
I have to keep only one value from the repeated rows but after taking out average.
In case of my first peptide I don't want to consider this row
5 P08032 133.4400
My idea is to take out only those rows which are above a certain threshold and if it passes the threshold then taking average and add or generate a new data-frame where only unique row will remain and their average value.
So each it possible to define different threshold for these individual unique rows.
Here in my small subset I have three unique row. So is that possible for me to put three different threshold and then get the average which .
Any suggestion or help would be really appreciated
UPDATE
Although what i read from papers that people consider maximum threshold . May be I can take if the intensity is above 5000 but again Im not sure if the rest of the peptide which has less than 5000 how do i consider that.
But right now I will take this cutoff of 5000.
CodePudding user response:
Sharing 3 methods to solve the mentioned problem.
Method I: Using aggregate
function
aggregate(test_Data[test_Data$Intensity >= 5000, 2], list(test_Data[test_Data$Intensity >= 5000, ]$UNIPROT), FUN = mean)
Output:
Group.1 x
1 A8DUK4 116268.06
2 P08032 23300.23
3 P15508 28483.30
Method II: Using dplyr
package
library(dplyr)
test_Data %>%
filter(Intensity >= 5000) %>%
group_by(UNIPROT) %>%
summarise(Mean_Intensity = mean(Intensity))
Output:
# A tibble: 3 x 2
UNIPROT Mean_Intensity
<fct> <dbl>
1 A8DUK4 116268.
2 P08032 23300.
3 P15508 28483.
Method III: Using data.table
package
library(data.table)
setDT(test_Data) # Converting to data.table object (necessary step)
test_Data[Intensity >= 5000,.(Mean_Intensity = mean(Intensity)), by = .(UNIPROT)]
Output:
UNIPROT Mean_Intensity
1: P08032 23300.23
2: P15508 28483.30
3: A8DUK4 116268.06