Filtering out the value with a threshold among the rows and getting average-CodePudding

I have a dataset from mass spec measurement. So in this small subset there are rows or peptides which are repeated but with different intensity.

a <- dput(test_Data)
structure(list(UNIPROT = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L), .Label = c("A8DUK4", "P08032", "P15508"), class = "factor"), 
    Intensity = c(16926.19, 36738.94, 2203.22, 5338.85, 133.44, 
    27991.35, 29505.84, 201.4695, 47469.09, 24841.01, 4546.9, 
    22805.69, 18494.71, 28805.99, 68220.65, 90526.29, 63259.19, 
    44492.48, 65497.13, 40704.81, 334874.1, 38702.87, 300135)), class = "data.frame", row.names = c(NA, 
-23L))

Data frame

UNIPROT   Intensity
1   P08032  16926.1900
2   P08032  36738.9400
3   P08032   2203.2200
4   P08032   5338.8500
5   P08032    133.4400
6   P08032  27991.3500
7   P08032  29505.8400
8   P15508    201.4695
9   P15508  47469.0900
10  P15508  24841.0100
11  P15508   4546.9000
12  P15508  22805.6900
13  P15508  18494.7100
14  P15508  28805.9900
15  A8DUK4  68220.6500
16  A8DUK4  90526.2900
17  A8DUK4  63259.1900
18  A8DUK4  44492.4800
19  A8DUK4  65497.1300
20  A8DUK4  40704.8100
21  A8DUK4 334874.1000
22  A8DUK4  38702.8700
23  A8DUK4 300135.0000

So My objective

I have to keep only one value from the repeated rows but after taking out average.

In case of my first peptide I don't want to consider this row

5   P08032    133.4400

My idea is to take out only those rows which are above a certain threshold and if it passes the threshold then taking average and add or generate a new data-frame where only unique row will remain and their average value.

So each it possible to define different threshold for these individual unique rows.

Here in my small subset I have three unique row. So is that possible for me to put three different threshold and then get the average which .

Any suggestion or help would be really appreciated

UPDATE

Although what i read from papers that people consider maximum threshold . May be I can take if the intensity is above 5000 but again Im not sure if the rest of the peptide which has less than 5000 how do i consider that.

But right now I will take this cutoff of 5000.

CodePudding user response：

Sharing 3 methods to solve the mentioned problem.

Method I: Using aggregate function

aggregate(test_Data[test_Data$Intensity >= 5000, 2], list(test_Data[test_Data$Intensity >= 5000, ]$UNIPROT), FUN = mean)

Output:

  Group.1         x
1  A8DUK4 116268.06
2  P08032  23300.23
3  P15508  28483.30

Method II: Using dplyr package

library(dplyr)
test_Data %>% 
  filter(Intensity >= 5000) %>%
  group_by(UNIPROT) %>%
  summarise(Mean_Intensity = mean(Intensity))

Output:

# A tibble: 3 x 2
  UNIPROT Mean_Intensity
  <fct>            <dbl>
1 A8DUK4         116268.
2 P08032          23300.
3 P15508          28483.

Method III: Using data.table package

library(data.table)
setDT(test_Data) # Converting to data.table object (necessary step)

test_Data[Intensity >= 5000,.(Mean_Intensity = mean(Intensity)), by = .(UNIPROT)]

Output:

   UNIPROT Mean_Intensity
1:  P08032       23300.23
2:  P15508       28483.30
3:  A8DUK4      116268.06