this post is related to this one: Cumulatively add number between rows in order to create new columns in R
But with other rules.
I have a dataframe such as
Sp1 start end
A 100 1077
B 2316 4088
B 26647 28746
B 50000 60000
C 450 789
and I would like to add two new columns where I cumulatively add previous start
and end
coordinates
But I do not treat the same way when the first Sp1 appear compared to the other ones in the dataframe.
for first I should always get the same columns for the first row:
Sp1 start end new_start new_end
A 100 1077 100 1077
Then for the first B :
Sp1 start end new_start new_end
A 100 1077 100 1077
B 2316 4088 1078 2850
where 1078 = 1077 1
where 2850 = 1078 (4088-2316)
for the other B it is different since it not the first B, so I do differently:
Sp1 start end new_start new_end
A 100 1077 100 1077
B 2316 4088 1078 2850
B 26647 28746 25409 27508
- where
2850 (26647-4088) = 25409
- where
25409 (28746-26647) = 27508
For the Third B it is the same as the second B:
Sp1 start end new_start new_end
A 100 1077 100 1077
B 2316 4088 1078 2850
B 26647 28746 25409 27508
B 50000 60000 48762 58762
- where
27508 (50000-28746) = 48762
- where
48762 (60000-50000) = 58762
for the C it is the first C, so I do as :
Sp1 start end new_start new_end
A 100 1077 100 1077
B 2316 4088 1078 2850
B 26647 28746 25409 27508
B 50000 60000 48762 58762
C 450 789 58763 59101
- where
58763 = 58762 1
- where
59101 = 58762 (789-450)
Then at the end, I should get the expect result:
Sp1 start end new_start new_end A 100 1077 100 1077 B 2316 4088 1078 2850 B 26647 28746 25409 27508 B 50000 60000 48762 58762 C 450 789 58763 59101
If someone has an idea, it would be amazing !
Here is the dataframe in dput format if it can help :
structure(list(Sp1 = c("A", "B", "B", "B", "C"), start = c(100L,
2316L, 26647L, 50000L, 450L), end = c(1077, 4088, 28746, 60000,
789)), class = "data.frame", row.names = c(NA, -5L))
CodePudding user response:
Since you can't compute all columns at once (you need to wait for previous iterations to be able to compute the result for line i
), just use a loop. The row number per Sp1
can be done at once using dplyr
:
df <- df %>% group_by(Sp1) %>% mutate(sp_row = row_number()) %>% ungroup()
df$new_start <- df$new_end <- NA
df$new_start[1] <- df$start[1]
df$new_end[1] <- df$end[1]
for( i in 2:nrow(df)) {
if(df$sp_row[i]==1) {
df$new_start[i] <- df$new_end[i-1] 1
df$new_end[i] <- df$new_start[i] df$end[i]-df$start[i]
}
if(df$sp_row[i]!=1) {
df$new_start[i] <- df$start[i]-df$new_end[i-1]
df$new_end[i] <- df$new_start[i] df$end[i]-df$start[i]
}
}
# A tibble: 5 x 6
Sp1 start end new_start new_end sp_row
<chr> <int> <dbl> <dbl> <dbl> <int>
1 A 100 1077 100 1077 1
2 B 2316 4088 1078 2850 1
3 B 26647 28746 23797 25896 2
4 B 50000 60000 24104 34104 3
5 C 450 789 34105 34444 1
There is at least one mistake in your example btw: 50000-25896 = 29053
is wrong.