Home > Enterprise >  How can I clean a panel dataset in R such that only the entries with one or more than two consecutiv
How can I clean a panel dataset in R such that only the entries with one or more than two consecutiv

Time:12-25

For a little research project, I have to work with a large panel dataset that contains the Gini coefficient for multiple countries and multiple years. However, data is not always collected annually and it occurs quite often that for a country there is only a couple of data points available.

My aim is to select only those countries for which there are two or more consecutive years available.

For example: A: 2008 10 2012 10.7 2015 12

Should be deleted from the dataset, however, I want to keep entities such as:

B: 2001 5 2002 5.6 2003 7 2009 8.6

I tried the following code:

df_new <- ddply(df,"country",function(x) {
  cons_idx=which(diff(x$year)==1)
  cons_idx=sort(unique(c(cons_idx,cons_idx 1)))
  x[cons_idx,]
})

Based on a similar question posted on stakoverflow 1

While this chunk of code does discard A (correctly) it also discards 2009 from B. Hence, it returns: B: 2001 5 2002 5.6 2003 7

Yet, I do want 2009 included, it's just the entities without any two consecutive years that should be discarded.

My intuition is that I have to work with an if-else statement of some sort, but I'm really stuck. Any help would be greatly appreciated!

CodePudding user response:

Here is a tidyverse approach. For each country, check for any 1-year differences. This will keep all data for the country, if this evaluation is true.

df <- data.frame(
  country = c("A", "A", "A", "B", "B", "B", "B"),
  year = c(2008, 2012, 2015, 2001, 2002, 2003, 2009),
  value = c(10, 10.7, 12, 5, 5.6, 7, 8.6)
)

library(tidyverse)

df %>%
  group_by(country) %>%
  filter(any(diff(year) == 1))

Output

  country  year value
  <chr>   <dbl> <dbl>
1 B        2001   5  
2 B        2002   5.6
3 B        2003   7  
4 B        2009   8.6

CodePudding user response:

Using ave you could group year by country and test if any year differences are equal == to 1. Since ave unfortunately coerces the result to numeric, we insist as.logical, and subset the data frame within the brackets.

df[with(df, as.logical(ave(year, country, FUN=\(x) any(c(0, diff(x)) == 1)))), ]
#   country year value
# 4       B 2001   5.0
# 5       B 2002   5.6
# 6       B 2003   7.0
# 7       B 2009   8.6

Data:

df <- structure(list(country = c("A", "A", "A", "B", "B", "B", "B"), 
    year = c(2008, 2012, 2015, 2001, 2002, 2003, 2009), value = c(10, 
    10.7, 12, 5, 5.6, 7, 8.6)), class = "data.frame", row.names = c(NA, 
-7L))
  • Related