Home > OS >  Signal all years where a given patter is observed in R
Signal all years where a given patter is observed in R

Time:11-27

I have a dataset like:

year = c("2000", "2000", "2000", "2002", "2000", "2002", "2007")
id = c("X", "X", "X", "X", "Z", "Z", "Z")
product = c("apple", "orange", "orange", "orange", "cake", "cake", "bacon")
market = c("CHN", "USA", "USA", "USA", "SPA", "CHL", "CHL")
df = data.frame(year, id, product, market)

I want to create 3 variables indicating:

  1. FPFM = takes value 1 if it is the first time with this product in this given market
  2. FP = takes value 1 if it is the first time with this product
  3. FM = takes value 1 if it is the first time in this market:

Therefore, the new data will look like:

year = c("2000", "2000", "2000", "2002", "2000", "2002", "2007")
id = c("X", "X", "X", "X", "Z", "Z", "Z")
product = c("apple", "orange", "orange", "orange", "cake", "cake", "bacon")
market = c("CHN", "USA", "USA", "USA", "SPA", "CHL", "CHL")
FPFM = c(1, 1, 1, 0, 1, 1, 1)
FP = c(1, 1, 1, 0, 1, 0, 1)
FM = c(1, 1, 1, 0, 1, 1, 0)
df_desired = data.frame(year, id, product, market, FPFM, FP, FM)

I have tried the following df_new code without success:

df_new <- df %>%
  arrange(id, year) %>% 
  group_by(id, product, market) %>% 
  mutate(FPFM = row_number(year) == 1) %>% 
  as.data.frame() %>% 
  group_by(id, product) %>% 
  mutate(FP = row_number(year) == 1) %>% 
  as.data.frame() %>% 
  group_by(id, market) %>% 
  mutate(FM = row_number(year) == 1) %>% 
  as.data.frame() 

It only gives a value for really the first observation. I want to have the value for the FIRST YEAR that product,market or combination of the two is observed.

Row 3 should be "TRUE; TRUE; TRUE" instead of "FALSE; FASLE; FALSE" as it belongs to the same year.

The other solution that I think about is to summarise df by unique values 3 times and then right join with the original df. However, this will take lot of time and space as I have lots of data.

Do you have a most efficient and integrated solution?

CodePudding user response:

Change row_number(year) == 1 to year == year[1]:

df_new <- df %>%
  arrange(id, year) %>% 
  group_by(id, product, market) %>% 
  mutate(FPFM = year == year[1]) %>% 
  group_by(id, product) %>% 
  mutate(FP = year == year[1]) %>% 
  group_by(id, market) %>% 
  mutate(FM = year == year[1])

Also, the repetition as.data.frame seems unnecessary. You can keep the last one if you really want a data.frame instead of a tibble, but in my opinion tibbles are a better option. Check this section of "Advanced R" for some reasons why.

Result:

> df_new
  year id product market  FPFM    FP    FM
1 2000  X   apple    CHN  TRUE  TRUE  TRUE
2 2000  X  orange    USA  TRUE  TRUE  TRUE
3 2000  X  orange    USA  TRUE  TRUE  TRUE
4 2002  X  orange    USA FALSE FALSE FALSE
5 2000  Z    cake    SPA  TRUE  TRUE  TRUE
6 2002  Z    cake    CHL  TRUE FALSE  TRUE
7 2007  Z   bacon    CHL  TRUE  TRUE FALSE

CodePudding user response:

I would just make a little helper function to make the code a little cleaner. Note we can change logical to binary with math

library(tidyverse)

which.firsts <- function(.data, ...){
  .data %>%
    arrange(id, year) %>% 
    group_by(...) %>%
    mutate(.val = ` `(year == first(year))) %>%
    pull(.val)
}

df %>%
  mutate(FPFM = which.firsts(., id, product, market),
         FP = which.firsts(., id, product),
         FM  = which.firsts(., id, market))
#>   year id product market FPFM FP FM
#> 1 2000  X   apple    CHN    1  1  1
#> 2 2000  X  orange    USA    1  1  1
#> 3 2000  X  orange    USA    1  1  1
#> 4 2002  X  orange    USA    0  0  0
#> 5 2000  Z    cake    SPA    1  1  1
#> 6 2002  Z    cake    CHL    1  0  1
#> 7 2007  Z   bacon    CHL    1  1  0
  • Related