Lets say I have the following data frame:
df <- data.frame(
id = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3),
result = c(NA, 33, 13, 44, 23, 44, 52, 11, NA, NA),
flag = c("", "", "Y", "", "Y", "", "", "", "", ""),
col_1 = c("a", "b", "c", "d" , "e" , "f" , "g" , "h", "i", "j")
)
Outcome:
id result flag col_1
1 1 NA a
2 1 33 b
3 1 13 Y c
4 1 44 d
5 2 23 Y e
6 2 44 f
7 2 52 g
8 2 11 h
9 3 NA i
10 3 NA j
Each unique "id" has a base value. The base value for each unique "id" is equal to the "value" when "flag" is equal to "Y".
I want to create a new column named "base_vale" that stores the base value for each unique "id" and then calculate a new column named percentage change. The final outcome looks like as follow:
id result flag col_1 base_value percentage_change
3 1 NA a 13 NA
4 1 33 b 13 153.846154
1 1 13 Y c 13 0.000000
2 1 44 d 13 238.461538
7 2 23 Y e 23 0.000000
8 2 44 f 23 91.304348
5 2 52 g 23 126.086957
6 2 11 h 23 -52.173913
9 3 NA i NA NA
10 3 NA j NA NA
The code I am using now is as follow:
df_right <- df[df$flag == "Y", ]
df <- merge(x = df,
y = df_right,
by = "id",
all.x = TRUE)
df <- df %>% rename(
flag = flag.x,
result = result.x,
base_value = result.y,
col_1 = col_1.x
)
df <- subset(df, select = -c(flag.y, col_1.y))
df$percentage_change <-
(df$result - df$base_value) / df$base_value * 100
df <- df[order(df$col_1), ]
Another potential solution could be the following code if I didn't have NA values in result. So the following code doesn't word:
df %>% group_by(id) %>%
mutate(base_value = result[which(flag == "Y")] ,
percentage_change = (result - base_value)/base_value * 100) %>%
ungroup()
I am not happy with my solution and am looking for a cleaner and nicer code to get the job done. Anny help is much appreciated.
CodePudding user response:
We can use ifelse
to check if there are any Y
s in the group, and set NA
otherwise:
df %>%
group_by(id) %>%
mutate(
base_value = ifelse(any(flag == "Y"), result[flag == "Y"], NA),
percentage_change = (result - base_value)/base_value * 100
) %>%
ungroup()
# # A tibble: 10 × 6
# id result flag col_1 base_value percentage_change
# <int> <int> <chr> <chr> <int> <dbl>
# 1 1 NA "" a 13 NA
# 2 1 33 "" b 13 154.
# 3 1 13 "Y" c 13 0
# 4 1 44 "" d 13 238.
# 5 2 23 "Y" e 23 0
# 6 2 44 "" f 23 91.3
# 7 2 52 "" g 23 126.
# 8 2 11 "" h 23 -52.2
# 9 3 NA "" i NA NA
# 10 3 NA "" j NA NA
Also please note that empty strings ""
are different from missing values NA
are different from NULL
values (don't exist, length 0, can't really be in data frames).
Using this data:
df = read.table(text = " id result flag col_1
1 1 NA '' a
2 1 33 '' b
3 1 13 Y c
4 1 44 '' d
5 2 23 Y e
6 2 44 '' f
7 2 52 '' g
8 2 11 '' h
9 3 NA '' i
10 3 NA '' j", header = T)
CodePudding user response:
Slightly different approach in defining the base value. The formula from @Gregor Thomas:
library(dplyr)
df %>%
group_by(id) %>%
arrange(flag, .by_group = TRUE) %>%
mutate(base_value = last(result),
percent_change = (result - base_value)/base_value * 100)
id result flag col_1 base_value percent_change
<dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 NA "" a 13 NA
2 1 33 "" b 13 154.
3 1 44 "" d 13 238.
4 1 13 "Y" c 13 0
5 2 44 "" f 23 91.3
6 2 52 "" g 23 126.
7 2 11 "" h 23 -52.2
8 2 23 "Y" e 23 0
9 3 NA "" i NA NA
10 3 NA "" j NA NA
CodePudding user response:
It may be also done by extraction on the first [1]
element i.e. if there are no 'Y' cases, it automatically becomes NA
library(dplyr)
df %>%
group_by(id) %>%
mutate(base_value = result[flag == 'Y'][1],
percentage_change = 100 *(result - base_value)/base_value) %>%
ungroup
-output
# A tibble: 10 × 6
id result flag col_1 base_value percentage_change
<dbl> <dbl> <chr> <chr> <dbl> <dbl>
1 1 NA "" a 13 NA
2 1 33 "" b 13 154.
3 1 13 "Y" c 13 0
4 1 44 "" d 13 238.
5 2 23 "Y" e 23 0
6 2 44 "" f 23 91.3
7 2 52 "" g 23 126.
8 2 11 "" h 23 -52.2
9 3 NA "" i NA NA
10 3 NA "" j NA NA