R function to find the sum of c given the column values-CodePudding

I want to create a function that I can simulate n number of times. My ultimate goal is to find if the sum of c for every n number of simulations. I am a beginner in r-coding so I am just starting to practice with for loops and if else statements.

This is what I hope to achieve as of now: If a> b, c would be "2" and if a < b, c would be "-2". If a=b, c would be determined by the a and b value of the NEXT row. This is what i have so far, but I am keep getting errors. I would like to know if what I have for a=b is how I should approach this. Any help is appreciated.

a<-c(5,6,7,8,9,10,1,4,6,7)
b<-c(4,6,8,5,3,4,5,2,1,3)
c<-c(0,0,0,0,0,0,0,0,0,0)

df<-data.frame(a,b,c)

if(df$a > df$b){
  df$c<- c(2)}
  
  else if(df$a < df$b){
    df$c<- c(-2)}
  
  else if(df$a == df$b){ # a=b
        
    if(df$a[ 1,] > df$b[ 1,]) {
          df$c<- c(2)}
    else(df$a[ 1,] < df$b[ 1,]){
      df$c<- c(-2) }
  }
  else
print("error")

}

sum(df$c)

CodePudding user response：

Edit: two users already pointed out what to do if there's consecutive rows of a == b. Good opportunity to dive into the tidyverse (as already suggested by others):

library(dplyr)
library(tidyr)

df <- data.frame(
  a = c(5,6,7,8,9,10,1,4,6,7),
  b = c(4,6,8,5,3,4,5,2,1,3)
)

df %>%
  mutate(c = ifelse(a == b, NA, 2 * sign(a-b))) %>% ## (1)
  fill(c, .direction = 'up') ## (2)

(1) set c to NA when a == b (2) 'fill' (replace) NAs with the next availabe value down the rows

Starting with R, it's helpful to know that vectorizing (the x[n] thing) usually makes your code conciser and—in certain situations— much faster than using loops. In your case:

df$c <- 2 * sign(df$a - df$b) ## see ?sign
z <- df$c == 0 ## see (1)
df$c[z] = lead(df$c,1)[z] ## see (2)

(1) equal numbers have sign zero, z is a boolean vector indicating the positions (rows) where a == b (thus: z is TRUE) (2) change c only at the positions where z is TRUE. lead and lag are functions taking a vector and returning its shifted (by a given number of positions) vector.

CodePudding user response：

The problem

if() and else() in R is meant for control flow, and is not vectorized. In plain English this means that if() is expecting a statement evaluating to one TRUE or FALSE. When you do df$a > df$b you get a boolean vector of the same length as rows in your dataframe. When this happens, if() will only use the first item, and give you a warning. This will give you the wrong answers.

A better solution

I think you are looking for ifelse() which is vectorized. And since you have nested if-else statements you are probably better off with dplyr::case_when().

Here is an example which also fixes cases where a == b for multiple rows:

# Note that I've added two consecutive rows where a == b
a <- c(5,6,6,7,8,9,10,1,4,6,7)
b <- c(4,6,6,8,5,3,4,5,2,1,3)

df <- data.frame(a, b)


library(dplyr)

df %>% 
  mutate(
    c = case_when(
      a > b ~  2,
      a < b ~ -2,
      #  If not a > b nor a < b is TRUE, they must be equal, 
      # so we set all other cases to NA...
      TRUE ~ NA_real_
    )
  ) %>% 
  # ... and then we use fill() to replace NAs with the first 
  # non NA valua after it
  tidyr::fill(c, .direction = "up")

#>     a b  c
#> 1   5 4  2
#> 2   6 6 -2
#> 3   6 6 -2
#> 4   7 8 -2
#> 5   8 5  2
#> 6   9 3  2
#> 7  10 4  2
#> 8   1 5 -2
#> 9   4 2  2
#> 10  6 1  2
#> 11  7 3  2

^{Created on 2022-03-30 by the reprex package (v2.0.1)}

How this works:

ifelse() works like if() and else() in your code, but it accepts multiple values
case_when() acts like nested ifelse() statements, so it will first check if a > b and set those values equal to 2, next it will check the remaining rows if a < b and set those to -2 and so on.
In cases where a is not less nor more than b, they must be equal. We set these cases to NA.
After we use tidyr::fill() to replace missing values with the first instance of a non-missing value after it. This handles cases where there are multiple consecutive rows of a == b.

CodePudding user response：

Here is a tidyverse solution. This will also work with multiple equal a and b in series (I have added row 3 to the data to demonstrate).

It relies on cumsum() to group the data, such that rows with a == b are in the same group as the next row that is a != b. Then it sets c to the last value in the group.

library(tidyverse) 

a<-c(5,6,5,7,8,9,10,1,4,6,7)
b<-c(4,6,5,8,5,3,4,5,2,1,3)

df <-data.frame(a,b)

df |> 
  mutate(c = ifelse(a>b, 2, -2), # Determines c for `a != b` cases
         grp = rev(cumsum(rev(a != b)))) |> # create group variable, use rev() since we want backward cumsum
  group_by(grp) |> 
  mutate(c = last(c)) |> 
  ungroup() |> 
  select(-grp)
#> # A tibble: 11 × 3
#>        a     b     c
#>    <dbl> <dbl> <dbl>
#>  1     5     4     2
#>  2     6     6    -2
#>  3     5     5    -2
#>  4     7     8    -2
#>  5     8     5     2
#>  6     9     3     2
#>  7    10     4     2
#>  8     1     5    -2
#>  9     4     2     2
#> 10     6     1     2
#> 11     7     3     2

^{Created on 2022-03-30 by the reprex package (v2.0.1)}