For a little research project, I have to work with a large panel dataset that contains the Gini coefficient for multiple countries and multiple years. However, data is not always collected annually and it occurs quite often that for a country there is only a couple of data points available.
My aim is to select only those countries for which there are two or more consecutive years available.
For example: A: 2008 10 2012 10.7 2015 12
Should be deleted from the dataset, however, I want to keep entities such as:
B: 2001 5 2002 5.6 2003 7 2009 8.6
I tried the following code:
df_new <- ddply(df,"country",function(x) {
cons_idx=which(diff(x$year)==1)
cons_idx=sort(unique(c(cons_idx,cons_idx 1)))
x[cons_idx,]
})
Based on a similar question posted on stakoverflow 1
While this chunk of code does discard A (correctly) it also discards 2009 from B. Hence, it returns: B: 2001 5 2002 5.6 2003 7
Yet, I do want 2009 included, it's just the entities without any two consecutive years that should be discarded.
My intuition is that I have to work with an if-else statement of some sort, but I'm really stuck. Any help would be greatly appreciated!
CodePudding user response:
Here is a tidyverse
approach. For each country, check for any 1-year differences. This will keep all data for the country, if this evaluation is true.
df <- data.frame(
country = c("A", "A", "A", "B", "B", "B", "B"),
year = c(2008, 2012, 2015, 2001, 2002, 2003, 2009),
value = c(10, 10.7, 12, 5, 5.6, 7, 8.6)
)
library(tidyverse)
df %>%
group_by(country) %>%
filter(any(diff(year) == 1))
Output
country year value
<chr> <dbl> <dbl>
1 B 2001 5
2 B 2002 5.6
3 B 2003 7
4 B 2009 8.6
CodePudding user response:
Using ave
you could group year by country and test if any
year diff
erences are equal ==
to 1
. Since ave
unfortunately coerces the result to numeric, we insist as.logical
, and subset the data frame within the brackets.
df[with(df, as.logical(ave(year, country, FUN=\(x) any(c(0, diff(x)) == 1)))), ]
# country year value
# 4 B 2001 5.0
# 5 B 2002 5.6
# 6 B 2003 7.0
# 7 B 2009 8.6
Data:
df <- structure(list(country = c("A", "A", "A", "B", "B", "B", "B"),
year = c(2008, 2012, 2015, 2001, 2002, 2003, 2009), value = c(10,
10.7, 12, 5, 5.6, 7, 8.6)), class = "data.frame", row.names = c(NA,
-7L))