I have a large dataframe (400,000 rows). I am trying to create a new column based on a condition. Grouping by Gene
and Group
before evaluating the condition is necessary, but the new column should be generated independently for each row (if condition is met).
The condition: two or more Gene-Group must have a Counts
value >=10. For instance, Gene1-GroupB does not pass this condition, while both Gene2-GroupB and Gene3-GroupB pass. I haven't been able to figure out how to code this condition.
If the Gene-Group passes the condition, I want to create a new column with the value of (NascentCountsCPM / TotalCountsCPM)
for each row. If the condition is not met, I want to return NA values. I believe a combination of dplyr mutate()
and case_when()
can be used, but I simply cannot wrap my head around how to code the described condition.
Thanks for any help!
#Starting DataFrame
df>
| Gene | Group | Sample | Counts | TotalCountsCPM | NascentCountsCPM |
|-------|-------|----------|--------|----------------|-----------------|
| Gene1 | B | SampleB1 | 7 | 1.36 | 0 |
| Gene2 | B | SampleB1 | 269 | 52.29 | 2.92 |
| Gene3 | B | SampleB1 | 25 | 4.86 | 0.19 |
| Gene1 | B | SampleB2 | 2 | 0.49 | 0 |
| Gene2 | B | SampleB2 | 212 | 52.45 | 0.99 |
| Gene3 | B | SampleB2 | 16 | 3.96 | 0 |
| Gene1 | B | SampleB3 | 3 | 0.64 | 0 |
| Gene2 | B | SampleB3 | 219 | 46.58 | 1.7 |
| Gene3 | B | SampleB3 | 41 | 8.72 | 0.21 |
df_new <- df %>%
group_by(Gene, Group) %>%
mutate(CountsRatio = case_when((CONDITION) ~ (NascentCountsCPM / TotalCountsCPM),
TRUE ~ NA_real_)
>df_new
| Gene | Group | Sample | Counts | TotalCountsCPM | NascentCountsCPM | CountsRatio |
|-------|-------|----------|--------|----------------|------------------|-------------|
| Gene1 | B | SampleB1 | 7 | 1.36 | 0 | NA |
| Gene2 | B | SampleB1 | 269 | 52.29 | 2.92 | 0.056 |
| Gene3 | B | SampleB1 | 25 | 4.86 | 0.19 | 0.4 |
| Gene1 | B | SampleB2 | 2 | 0.49 | 0 | NA |
| Gene2 | B | SampleB2 | 212 | 52.45 | 0.99 | 0.019 |
| Gene3 | B | SampleB2 | 16 | 3.96 | 0 | 0 |
| Gene1 | B | SampleB3 | 3 | 0.64 | 0 | NA |
| Gene2 | B | SampleB3 | 219 | 46.58 | 1.7 | 0.056 |
| Gene3 | B | SampleB3 | 41 | 8.72 | 0.21 | 0.024 |
df <- data.frame(
stringsAsFactors = FALSE,
Gene = c("Gene1","Gene2","Gene3",
"Gene1","Gene2","Gene3","Gene1","Gene2","Gene3"),
Group = c("B", "B", "B", "B", "B", "B", "B", "B", "B"),
Sample = c("SampleB1","SampleB1",
"SampleB1","SampleB2","SampleB2","SampleB2","SampleB3",
"SampleB3","SampleB3"),
Counts = c(7L, 269L, 25L, 2L, 212L, 16L, 3L, 219L, 41L),
TotalCountsCPM = c(1.36, 52.29, 4.86, 0.49, 52.45, 3.96, 0.64, 46.58, 8.72),
NascentCountsCPM = c(0, 2.92, 0.19, 0, 0.99, 0, 0, 1.7, 0.21)
)
CodePudding user response:
df %>%
add_count(Gene, Group, wt = Counts > 10) %>%
mutate(CountsRatio = if_else(n >= 2, NascentCountsCPM/TotalCountsCPM, NA_real_))