I have a set of biological count data within a data frame in R which has 200,000 entries. I am looking to write a function that will identify the peaks within the count data. By peaks, I want the top 50 count data. I am expecting there to be multiple peaks within this dataset as the median value is 0. When inputting:
> summary(df$V3)
My output looks like this:
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 0.00 0.00 1.82 1.00 94746.00
I am wanting to write a function that will list the peaks and then look at the numbers on either side of the peaks ( 1 and -1) to produce a ratio. Can anyone help with this?
My dataframe looks like this and is labelled df:
V1 V2 V3
gene 1 6
gene 2 0
gene 3 0
gene 4 10
....
My expected output would be a data frame identifying the peaks, and at what position (V2) within this dataset so I can examine the numbers on either side of the peaks to produce a ratio for analysis.
CodePudding user response:
This is a crude way of doing this, this will give you values on either side of the peak, where you can make a ratio.
Here I considered the peaks as any value higher than the mean.
library(tidyverse)
"V1 V2 V3
gene 1 6
gene 2 0
gene 3 0
gene 4 10
gene 5 1" %>%
read_table() -> df
mean <- 1.82
df %>%
filter(V3 > mean) %>%
pull(V2) -> ids
df %>%
mutate(minus_peaks = lead(V3),
plus_peaks = lag(V3)) %>%
filter(V2 %in% ids)
# A tibble: 2 × 5
V1 V2 V3 minus_peaks plus_peaks
<chr> <dbl> <dbl> <dbl> <dbl>
1 gene 1 6 0 NA
2 gene 4 10 1 0
CodePudding user response:
This should give you the position (V2) of the peaks (vec
) and using the map
function from the purrr
package you can get the relevant ratio, i.e.
library(purrr)
library(dplyr)
# sample data
df <- data.frame(V1 = "gene",
V2 = 1:10000,
V3 = rbeta(10000,1,5))
vec <- df %>%
arrange(-V3) %>%
slice(1:50) %>%
pull(V2)
map_dfr(vec, ~df %>%
filter(V2 == .x-1 | V2 == .x 1 ) %>%
mutate(ratio = V3[1]/V3[2])
)