I have a dataset like:
year = c("2000", "2000", "2000", "2002", "2000", "2002", "2007")
id = c("X", "X", "X", "X", "Z", "Z", "Z")
product = c("apple", "orange", "orange", "orange", "cake", "cake", "bacon")
market = c("CHN", "USA", "USA", "USA", "SPA", "CHL", "CHL")
df = data.frame(year, id, product, market)
I want to create 3 variables indicating:
- FPFM = takes value 1 if it is the first time with this product in this given market
- FP = takes value 1 if it is the first time with this product
- FM = takes value 1 if it is the first time in this market:
Therefore, the new data will look like:
year = c("2000", "2000", "2000", "2002", "2000", "2002", "2007")
id = c("X", "X", "X", "X", "Z", "Z", "Z")
product = c("apple", "orange", "orange", "orange", "cake", "cake", "bacon")
market = c("CHN", "USA", "USA", "USA", "SPA", "CHL", "CHL")
FPFM = c(1, 1, 1, 0, 1, 1, 1)
FP = c(1, 1, 1, 0, 1, 0, 1)
FM = c(1, 1, 1, 0, 1, 1, 0)
df_desired = data.frame(year, id, product, market, FPFM, FP, FM)
I have tried the following df_new code without success:
df_new <- df %>%
arrange(id, year) %>%
group_by(id, product, market) %>%
mutate(FPFM = row_number(year) == 1) %>%
as.data.frame() %>%
group_by(id, product) %>%
mutate(FP = row_number(year) == 1) %>%
as.data.frame() %>%
group_by(id, market) %>%
mutate(FM = row_number(year) == 1) %>%
as.data.frame()
It only gives a value for really the first observation. I want to have the value for the FIRST YEAR that product,market or combination of the two is observed.
Row 3 should be "TRUE; TRUE; TRUE" instead of "FALSE; FASLE; FALSE" as it belongs to the same year.
The other solution that I think about is to summarise df by unique values 3 times and then right join with the original df. However, this will take lot of time and space as I have lots of data.
Do you have a most efficient and integrated solution?
CodePudding user response:
Change row_number(year) == 1
to year == year[1]
:
df_new <- df %>%
arrange(id, year) %>%
group_by(id, product, market) %>%
mutate(FPFM = year == year[1]) %>%
group_by(id, product) %>%
mutate(FP = year == year[1]) %>%
group_by(id, market) %>%
mutate(FM = year == year[1])
Also, the repetition as.data.frame
seems unnecessary. You can keep the last one if you really want a data.frame instead of a tibble, but in my opinion tibbles are a better option. Check this section of "Advanced R" for some reasons why.
Result:
> df_new
year id product market FPFM FP FM
1 2000 X apple CHN TRUE TRUE TRUE
2 2000 X orange USA TRUE TRUE TRUE
3 2000 X orange USA TRUE TRUE TRUE
4 2002 X orange USA FALSE FALSE FALSE
5 2000 Z cake SPA TRUE TRUE TRUE
6 2002 Z cake CHL TRUE FALSE TRUE
7 2007 Z bacon CHL TRUE TRUE FALSE
CodePudding user response:
I would just make a little helper function to make the code a little cleaner. Note we can change logical to binary with math
library(tidyverse)
which.firsts <- function(.data, ...){
.data %>%
arrange(id, year) %>%
group_by(...) %>%
mutate(.val = ` `(year == first(year))) %>%
pull(.val)
}
df %>%
mutate(FPFM = which.firsts(., id, product, market),
FP = which.firsts(., id, product),
FM = which.firsts(., id, market))
#> year id product market FPFM FP FM
#> 1 2000 X apple CHN 1 1 1
#> 2 2000 X orange USA 1 1 1
#> 3 2000 X orange USA 1 1 1
#> 4 2002 X orange USA 0 0 0
#> 5 2000 Z cake SPA 1 1 1
#> 6 2002 Z cake CHL 1 0 1
#> 7 2007 Z bacon CHL 1 1 0