Home > Net >  Cumulatively add number between rows in order to create new columns BUT with diffent way for the fir
Cumulatively add number between rows in order to create new columns BUT with diffent way for the fir

Time:08-11

this post is related to this one: Cumulatively add number between rows in order to create new columns in R

But with other rules.

I have a dataframe such as

Sp1 start end  
A   100   1077 
B   2316  4088
B   26647 28746
B   50000 60000
C   450    789

and I would like to add two new columns where I cumulatively add previous start and end coordinates

But I do not treat the same way when the first Sp1 appear compared to the other ones in the dataframe.

for first I should always get the same columns for the first row:

Sp1 start end     new_start  new_end 
A   100   1077    100        1077

Then for the first B :

Sp1 start end     new_start  new_end 
A   100   1077    100        1077
B   2316  4088    1078       2850
  • where 1078 = 1077 1
  • where 2850 = 1078 (4088-2316)

for the other B it is different since it not the first B, so I do differently:

Sp1 start end     new_start  new_end 
A   100   1077    100        1077
B   2316  4088    1078       2850
B   26647 28746   25409      27508
  • where 2850 (26647-4088) = 25409
  • where 25409 (28746-26647) = 27508

For the Third B it is the same as the second B:

Sp1 start end     new_start  new_end 
A   100   1077    100        1077
B   2316  4088    1078       2850
B   26647 28746   25409      27508
B   50000 60000   48762      58762
  • where 27508 (50000-28746) = 48762
  • where 48762 (60000-50000) = 58762

for the C it is the first C, so I do as :

Sp1 start end     new_start  new_end 
A   100   1077    100        1077
B   2316  4088    1078       2850
B   26647 28746   25409      27508
B   50000 60000   48762      58762
C   450   789     58763      59101
  • where 58763 = 58762 1
  • where 59101 = 58762 (789-450)

Then at the end, I should get the expect result:

Sp1 start end new_start new_end A 100 1077 100 1077 B 2316 4088 1078 2850 B 26647 28746 25409 27508 B 50000 60000 48762 58762 C 450 789 58763 59101

If someone has an idea, it would be amazing !

Here is the dataframe in dput format if it can help :

structure(list(Sp1 = c("A", "B", "B", "B", "C"), start = c(100L, 
2316L, 26647L, 50000L, 450L), end = c(1077, 4088, 28746, 60000, 
789)), class = "data.frame", row.names = c(NA, -5L))

CodePudding user response:

Since you can't compute all columns at once (you need to wait for previous iterations to be able to compute the result for line i), just use a loop. The row number per Sp1 can be done at once using dplyr:

df <- df %>% group_by(Sp1) %>% mutate(sp_row = row_number()) %>% ungroup()
df$new_start <- df$new_end <- NA
df$new_start[1] <- df$start[1]
df$new_end[1] <- df$end[1]
for( i in 2:nrow(df)) {
  if(df$sp_row[i]==1) {
    df$new_start[i] <- df$new_end[i-1] 1
    df$new_end[i] <- df$new_start[i] df$end[i]-df$start[i]
  }
  if(df$sp_row[i]!=1) {
    df$new_start[i] <- df$start[i]-df$new_end[i-1]
    df$new_end[i] <- df$new_start[i] df$end[i]-df$start[i]
  }
}
# A tibble: 5 x 6
  Sp1   start   end new_start new_end sp_row
  <chr> <int> <dbl>     <dbl>   <dbl>  <int>
1 A       100  1077       100    1077      1
2 B      2316  4088      1078    2850      1
3 B     26647 28746     23797   25896      2
4 B     50000 60000     24104   34104      3
5 C       450   789     34105   34444      1

There is at least one mistake in your example btw: 50000-25896 = 29053 is wrong.

  • Related