I have an R dataframe that looks like this:
chr bp instances_1 instances_2 instances_sum
1 143926410 0 1 1
1 144075771 1 0 1
1 187762696 0 2 2
1 187783844 2 0 2
2 121596288 0 1 1
2 122042325 3 0 3
2 259939985 1 0 1
2 259991389 0 1 1
What I would like to do is group by 'chr', determine if two rows are within 1e7 base-pairs ('bp') from one another, and if they are, retain the average (and round the average) and sum across all other columns that met the condition. So, the final product would look like:
chr bp instances_1 instances_2 instances_sum
1 144001091 1 1 2
1 187773270 2 2 4
2 121819307 3 1 4
2 259965687 1 1 2
I tried the to manipulate the following code (using tidyverse) that I used for a similar kind of task that did it over multiple columns:
df_Pruned <- df |>
group_by(chr_snp1, chr_snp2) |>
mutate(grp = (abs(loc_snp1 - lag(loc_snp1, default = first(loc_snp1))) < 1e7) &
(abs(loc_snp2 - lag(loc_snp2, default = first(loc_snp2))) < 1e7)) |>
group_by(grp, .add=TRUE) |>
filter(pval == min(pval)) |>
ungroup()|>
select(-grp)
into this by trying to do the same over one grouping variable ('chr') and by trying to average and sum at the same time:
df_Pruned <- df |>
group_by(chr) |>
mutate(grp = (abs(bp - lag(bp, default = first(bp))) < 1e7)) |>
group_by(grp, .add=TRUE) |>
filter(bp == mean(bp) & instances_sum == sum(instances_sum)) |>
ungroup()|>
select(-grp)
But I can't get it to work. I think I'm close but could use some help.
CodePudding user response:
Using cumsum
with the lag
condition produces your expected output:
df |>
mutate(grp = cumsum(abs(bp - lag(bp, default = first(bp))) > 1e7)) |>
group_by(chr, grp) |>
summarise(bp = mean(bp),
across(starts_with("instance"), sum),
.groups = "drop")
# A tibble: 4 × 6
chr grp bp instances_1 instances_2 instances_sum
<int> <int> <dbl> <int> <int> <int>
1 1 0 144001090. 1 1 2
2 1 1 187773270 2 2 4
3 2 2 121819306. 3 1 4
4 2 3 259965687 1 1 2