Home > Back-end >  How to group_by when there is null values R?
How to group_by when there is null values R?

Time:08-04

Lets say I have the following data frame:

df <- data.frame(
  id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3),
  result = c(NA, 33, 13, 44, 23, 44, 52, 11, NA, NA),
  flag = c("", "", "Y", "", "Y", "", "", "", "", ""),
  col_1 = c("a", "b", "c", "d" , "e" , "f" , "g" , "h", "i", "j")
)

Outcome:

   id  result  flag   col_1
1   1     NA           a
2   1     33           b
3   1     13    Y      c
4   1     44           d
5   2     23    Y      e
6   2     44           f
7   2     52           g
8   2     11           h
9   3     NA           i
10  3     NA           j

Each unique "id" has a base value. The base value for each unique "id" is equal to the "value" when "flag" is equal to "Y".

I want to create a new column named "base_vale" that stores the base value for each unique "id" and then calculate a new column named percentage change. The final outcome looks like as follow:

   id   result   flag col_1  base_value  percentage_change
3   1     NA          a         13        NA
4   1     33          b         13        153.846154
1   1     13    Y     c         13        0.000000
2   1     44          d         13        238.461538
7   2     23    Y     e         23        0.000000
8   2     44          f         23        91.304348
5   2     52          g         23        126.086957
6   2     11          h         23        -52.173913
9   3     NA          i         NA        NA
10  3     NA          j         NA        NA

The code I am using now is as follow:

df_right <- df[df$flag == "Y", ]
df <- merge(x = df,
            y = df_right,
            by = "id",
            all.x = TRUE)
df <- df %>% rename(
  flag = flag.x,
  result = result.x,
  base_value = result.y,
  col_1 = col_1.x
)
df <- subset(df, select = -c(flag.y, col_1.y))
df$percentage_change <-
  (df$result - df$base_value) / df$base_value * 100
df <- df[order(df$col_1), ]

Another potential solution could be the following code if I didn't have NA values in result. So the following code doesn't word:

df %>% group_by(id) %>%
  mutate(base_value = result[which(flag == "Y")] ,
         percentage_change = (result - base_value)/base_value * 100) %>%
  ungroup()

I am not happy with my solution and am looking for a cleaner and nicer code to get the job done. Anny help is much appreciated.

CodePudding user response:

We can use ifelse to check if there are any Ys in the group, and set NA otherwise:

df %>% 
  group_by(id) %>%
  mutate(
    base_value = ifelse(any(flag == "Y"), result[flag == "Y"], NA),
    percentage_change = (result - base_value)/base_value * 100
  ) %>%
  ungroup()
# # A tibble: 10 × 6
#       id result flag  col_1 base_value percentage_change
#    <int>  <int> <chr> <chr>      <int>             <dbl>
#  1     1     NA ""    a             13              NA  
#  2     1     33 ""    b             13             154. 
#  3     1     13 "Y"   c             13               0  
#  4     1     44 ""    d             13             238. 
#  5     2     23 "Y"   e             23               0  
#  6     2     44 ""    f             23              91.3
#  7     2     52 ""    g             23             126. 
#  8     2     11 ""    h             23             -52.2
#  9     3     NA ""    i             NA              NA  
# 10     3     NA ""    j             NA              NA  

Also please note that empty strings "" are different from missing values NA are different from NULL values (don't exist, length 0, can't really be in data frames).


Using this data:

df = read.table(text = "   id  result  flag   col_1
1   1     NA    ''       a
2   1     33    ''       b
3   1     13    Y      c
4   1     44    ''       d
5   2     23    Y      e
6   2     44    ''       f
7   2     52    ''       g
8   2     11    ''       h
9   3     NA    ''       i
10  3     NA    ''       j", header = T)

CodePudding user response:

Slightly different approach in defining the base value. The formula from @Gregor Thomas:

library(dplyr)

df  %>%  
  group_by(id) %>% 
  arrange(flag, .by_group = TRUE) %>% 
  mutate(base_value = last(result),
         percent_change = (result - base_value)/base_value * 100)

     id result flag  col_1 base_value percent_change
   <dbl>  <dbl> <chr> <chr>      <dbl>          <dbl>
 1     1     NA ""    a             13           NA  
 2     1     33 ""    b             13          154. 
 3     1     44 ""    d             13          238. 
 4     1     13 "Y"   c             13            0  
 5     2     44 ""    f             23           91.3
 6     2     52 ""    g             23          126. 
 7     2     11 ""    h             23          -52.2
 8     2     23 "Y"   e             23            0  
 9     3     NA ""    i             NA           NA  
10     3     NA ""    j             NA           NA 

CodePudding user response:

It may be also done by extraction on the first [1] element i.e. if there are no 'Y' cases, it automatically becomes NA

library(dplyr)
df %>% 
   group_by(id) %>%
   mutate(base_value = result[flag == 'Y'][1],
      percentage_change = 100 *(result - base_value)/base_value) %>%
   ungroup

-output

# A tibble: 10 × 6
      id result flag  col_1 base_value percentage_change
   <dbl>  <dbl> <chr> <chr>      <dbl>             <dbl>
 1     1     NA ""    a             13              NA  
 2     1     33 ""    b             13             154. 
 3     1     13 "Y"   c             13               0  
 4     1     44 ""    d             13             238. 
 5     2     23 "Y"   e             23               0  
 6     2     44 ""    f             23              91.3
 7     2     52 ""    g             23             126. 
 8     2     11 ""    h             23             -52.2
 9     3     NA ""    i             NA              NA  
10     3     NA ""    j             NA              NA  
  • Related