I have the following df:
df <- data.frame(comp_name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
year = c("2016", "2016", "2016", "2017","2017", "2017", "2016","2016", "2016", "2017", "2017", "2017"),
indicator = c("total_revenue", "overseas_revenue", "domestic_revenue", "total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue"),
value = c(100, NA, NA, 100, 20, 80, 90, NA, 60, 90, NA, NA))
The df looks like this:
comp_name | year | indicator | value |
---|---|---|---|
A | 2016 | total_revenue | 100 |
A | 2016 | overseas_revenue | NA |
A | 2016 | domestic_revenue | NA |
A | 2017 | total_revenue | 100 |
A | 2017 | overseas_revenue | 20 |
A | 2017 | domestic_revenue | 80 |
B | 2016 | total_revenue | 90 |
B | 2016 | overseas_revenue | NA |
B | 2016 | domestic_revenue | 60 |
B | 2017 | total_revenue | 90 |
B | 2017 | overseas_revenue | NA |
B | 2017 | domestic_revenue | NA |
I want to group by comp_name and year and apply the following rule to each group: if value for overseas_revenue AND domestic_revenue are NA, set value for domestic_revenue equal to the value for total_revenue, else do nothing.
The resulting df should look like this:
comp_name | year | indicator | value |
---|---|---|---|
A | 2016 | total_revenue | 100 |
A | 2016 | overseas_revenue | NA |
A | 2016 | domestic_revenue | 100 |
A | 2017 | total_revenue | 100 |
A | 2017 | overseas_revenue | 20 |
A | 2017 | domestic_revenue | 80 |
B | 2016 | total_revenue | 90 |
B | 2016 | overseas_revenue | NA |
B | 2016 | domestic_revenue | 60 |
B | 2017 | total_revenue | 90 |
B | 2017 | overseas_revenue | NA |
B | 2017 | domestic_revenue | 90 |
My actual dataset has 500k rows with 12 different indicators and I haven't been able to find an approach that works. Any help would be greatly appreciated-thanks!
CodePudding user response:
You can do it with two pivots:
library(dplyr)
library(tidyr)
df <- data.frame(comp_name = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"),
year = c("2016", "2016", "2016", "2017","2017", "2017", "2016","2016", "2016", "2017", "2017", "2017"),
indicator = c("total_revenue", "overseas_revenue", "domestic_revenue", "total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue","total_revenue", "overseas_revenue", "domestic_revenue"),
value = c(100, NA, NA, 100, 20, 80, 90, NA, 60, 90, NA, NA))
df %>%
pivot_wider(names_from="indicator",
values_from = "value") %>%
mutate(domestic_revenue = case_when(
is.na(overseas_revenue) & is.na(domestic_revenue) ~ total_revenue,
TRUE ~ domestic_revenue)) %>%
pivot_longer(-c(comp_name, year),
names_to = "indicator",
values_to = "value")
#> # A tibble: 12 × 4
#> comp_name year indicator value
#> <chr> <chr> <chr> <dbl>
#> 1 A 2016 total_revenue 100
#> 2 A 2016 overseas_revenue NA
#> 3 A 2016 domestic_revenue 100
#> 4 A 2017 total_revenue 100
#> 5 A 2017 overseas_revenue 20
#> 6 A 2017 domestic_revenue 80
#> 7 B 2016 total_revenue 90
#> 8 B 2016 overseas_revenue NA
#> 9 B 2016 domestic_revenue 60
#> 10 B 2017 total_revenue 90
#> 11 B 2017 overseas_revenue NA
#> 12 B 2017 domestic_revenue 90
Created on 2022-04-28 by the reprex package (v2.0.1)
CodePudding user response:
require(tidyverse)
df %>%
spread(indicator, value) %>%
mutate(domestic_revenue = case_when(
is.na(domestic_revenue) & is.na(overseas_revenue) ~ total_revenue,
TRUE ~ domestic_revenue
)) %>%
gather(c(-comp_name, -year), key = indicator, value = value) %>%
arrange(comp_name, year)
# A tibble: 12 x 4
comp_name year indicator value
<chr> <chr> <chr> <dbl>
1 A 2016 domestic_revenue 100
2 A 2016 overseas_revenue NA
3 A 2016 total_revenue 100
4 A 2017 domestic_revenue 80
5 A 2017 overseas_revenue 20
6 A 2017 total_revenue 100
7 B 2016 domestic_revenue 60
8 B 2016 overseas_revenue NA
9 B 2016 total_revenue 90
10 B 2017 domestic_revenue 90
11 B 2017 overseas_revenue NA
12 B 2017 total_revenue 90
CodePudding user response:
I think best to create a very simple function that handles the adjustment you want, and apply that function by group. It will be a lot faster than pivoting.
f <- function(i,v) {
if(all(is.na(v[grepl("^(o|d)",i)]))) v[i=="domestic_revenue"]=v[i=="total_revenue"]
return(v)
}
Using data.table (will be fast)
setDT(df)[,value:=f(indicator,value), by=.(comp_name, year)]
Using dplyr (will be slower, but still faster than pivoting)
df %>%
group_by(comp_name,year) %>%
mutate(value=f(indicator,value))
Output:
comp_name year indicator value
<char> <char> <char> <num>
1: A 2016 total_revenue 100
2: A 2016 overseas_revenue NA
3: A 2016 domestic_revenue 100
4: A 2017 total_revenue 100
5: A 2017 overseas_revenue 20
6: A 2017 domestic_revenue 80
7: B 2016 total_revenue 90
8: B 2016 overseas_revenue NA
9: B 2016 domestic_revenue 60
10: B 2017 total_revenue 90
11: B 2017 overseas_revenue NA
12: B 2017 domestic_revenue 90