Home > Back-end >  R: Adding an Indicator to Mark Newly Added Rows
R: Adding an Indicator to Mark Newly Added Rows

Time:01-23

I am working with the R programming language.

I have this dataset:

name = c("john", "john", "john", "sarah", "sarah", "peter", "peter", "peter", "peter")
year = c(2010, 2011, 2014, 2010, 2015, 2011, 2012, 2013, 2015)
age = c(21, 22, 25, 55, 60, 61, 62, 63, 65)
gender = c("male", "male", "male", "female", "female", "male", "male", "male", "male" )
country_of_birth = c("australia", "australia", "australia", "uk", "uk", "mexico", "mexico", "mexico", "mexico")
source = "ORIGINAL"

my_data = data.frame(name, year, age, gender, country_of_birth, source)

> my_data
   name year age gender country_of_birth   source
1  john 2010  21   male        australia ORIGINAL
2  john 2011  22   male        australia ORIGINAL
3  john 2014  25   male        australia ORIGINAL
4 sarah 2010  55 female               uk ORIGINAL
5 sarah 2015  60 female               uk ORIGINAL
6 peter 2011  61   male           mexico ORIGINAL
7 peter 2012  62   male           mexico ORIGINAL
8 peter 2013  63   male           mexico ORIGINAL
9 peter 2015  65   male           mexico ORIGINAL

In a previous question(Filling Spaces in Missing Rows) - for each person, I learned how to add rows where the intermediate years are missing, and populate these rows with the "logically" corresponding information (e.g. adjust the "age" variable while keeping "gender" and "country_of_birth" the same):

library(dplyr)
library(tidyr)

final = my_data %>% 
  group_by(name) %>% 
  complete(year = first(year): last(year)) %>% 
  mutate(age = ifelse(is.na(age), first(age) row_number()-1,age)) %>% 
  fill(c(gender, country_of_birth), .direction = "down")

   name   year   age gender country_of_birth source  
   <chr> <dbl> <dbl> <chr>  <chr>            <chr>   
 1 john   2010    21 male   australia        ORIGINAL
 2 john   2011    22 male   australia        ORIGINAL
 3 john   2012    23 male   australia        NA      
 4 john   2013    24 male   australia        NA    

Now, I would like to add a "source" variable that shows whether a row in this final dataset has been added or has existed all along

I had tried to add this logic to the DPLYR code, but I was not sure how to do this (I tried different ways but nothing worked). I had an idea that perhaps I could just do this with an ifelse() statement?

my_data$source <- ifelse(is.na(my_data$source), "MISSING", my_data$source)

   name   year   age gender country_of_birth source  
   <chr> <dbl> <dbl> <chr>  <chr>            <chr>   
 1 john   2010    21 male   australia        ORIGINAL
 2 john   2011    22 male   australia        ORIGINAL
 3 john   2012    23 male   australia        MISSING 

Can someone please tell me if this logic is correct and how I might have been able to implement this logic within the original DPLYR code?

Thanks!

CodePudding user response:

Alternatively you could try the below

data.frame(name, year, age, gender, country_of_birth, source) %>% group_by(name) %>% 
complete(year=full_seq(year, 1)) %>% 
   mutate(source=ifelse(is.na(source), 'MISSING', source)) %>%
  fill(c(age, gender, country_of_birth), .direction = 'down')

Created on 2023-01-23 with reprex v2.0.2

  •  Tags:  
  • r
  • Related