Reference to a column created by dplyr mutate() function-CodePudding

Good afternoon,

I'm newish and currently trying to work in SPARK using sparlyr and dplyr libraries and faced with a problem - after performing transformation with mutate function (for example, after adding a column) I can not reference to this newly-created column, however it is vital for my future calcualtions. In other words, my initial df does not have newly created column, and this column is present only in the transformation that I have done.

Here is an example:

#Creating a df
block1_value <- c(1000, 1500, 2000, 3000, 3500, 4000, 5000)
block2_value <- c(1, 2, 3, 4, 5, 6, 7)
block3_value <- c("a", "b", "c", "d", "e", "f", "g")

df <- data.frame(block1_value, block2_value, block3_value)

#Using mutate() to add new calculated column
df %>%
  mutate(Result = block1_value   block2_value)

#While referencing to this newly created column I do get an error
df %>%
  mutate(Result2 = ifelse(Result > 3000, "Yes", "No"))

How is it possible to fix this problem using dplyr syntax (the problem is that I can use only dplyr library as all the work is performed is Spark)

Thanks a lot!!

CodePudding user response：

mutate doesn't actually mutate a variable. It produces a modified copy of the dataframe. The following code works because the %>% operator forwards the result of the first mutate (i.e. the modified df) to the second mutate.

df %>%
  mutate(Result = block1_value   block2_value) %>%
  mutate(Result2 = ifelse(Result > 3000, "Yes", "No"))


#>  block1_value block2_value block3_value Result Result2
#>1         1000            1            a   1001      No
#>2         1500            2            b   1502      No
#>3         2000            3            c   2003      No
#>4         3000            4            d   3004     Yes
#>5         3500            5            e   3505     Yes
#>6         4000            6            f   4006     Yes
#>7         5000            7            g   5007     Yes

CodePudding user response：

You have not assigned the mutation to the dataframe.

This works

df <- df %>% mutate(Result = block1_value   block2_value)

df <-df %>% mutate(Result2 = ifelse(Result > 3000, "Yes", "No"))

But this is clean and efficient.

df <- df %>% mutate(Result = block1_value   block2_value) %>% 
             mutate(Result2 = ifelse(Result > 3000, "Yes", "No"))