Selecting later date observation in panel data in R-CodePudding

I have the following panel data in R:

ID_column<- c("A","A","A","A","B","B","B","B")
Date_column<-c(20040131, 20041231,20051231,20061231, 20051231, 20061231, 20071231, 20081231)
Price_column<-c(12,13,17,19,35,38,39,41)
Data<- data.frame(ID_column, Date_column, Price_column)

#The data looks like this:
 ID_column Date_column Price_column
1:         A    20040131           12
2:         A    20041231           13
3:         A    20051231           17
4:         A    20061231           19
5:         B    20051231           35
6:         B    20061231           38
7:         B    20071231           39
8:         B    20081231           41

My next aim would be to convert the Date column which is currently in a numeric YYYYMMDD format into YYYY by simply taking the first four digits of each entry in the data column as follows:

Data$Date_column<- substr(Data$Date_column,1,4)

#The data then looks like:
 ID_column Date_column Price_column
1         A        2004           12
2         A        2004           13
3         A        2005           17
4         A        2006           19
5         B        2005           35
6         B        2006           38
7         B        2007           39
8         B        2008           41

My ultimate goal would be to employ the plm package for panel data regression, but when applying the package and using pdata.frame to set the ID and Time variables as indices, I get error messages of duplicate ID/Time pairs (In this case rows 1 and 2 which would both be given the tag: A,2004). To solve this issue, I would like to delete row 1 in the original data, and only keep the newer observation from the year 2004. This would the provide me with unique ID/Time pairs across the whole data.

Therefore I was hoping for someone to help me out with a loop or a package suggestion with which I can only keep the row with the newer/later observation within a year, if this occurs, also for application to larger data sets.. I believe this involves a couple commands of conditional formatting which I am having difficulties putting together currently. I believe a loop that evaluates whether the first four digits of consecutive date observations are identical and then deletes the one with the "smaller" date/takes the "larger" date would do it, but my experience with loops is very limited.

Kind regards and thank you!

CodePudding user response：

I'd recommend to keep the Date_column as a reference to pick the later observation and mutate a new column for only the year,since you want the latest observation each year.

Data$year<- substr(Data$Date_column,1,4)
> Data$Date_column<- lubridate::ymd(Data$Date_column)
> 
> Data %>% arrange(desc(Date_column)) %>% 
    distinct(ID_column,year,.keep_all = TRUE) %>% 
    arrange(Date_column)

  ID_column Date_column Price_column year
1         A  2004-12-31           13 2004
2         A  2005-12-31           17 2005
3         B  2005-12-31           35 2005
4         A  2006-12-31           19 2006
5         B  2006-12-31           38 2006
6         B  2007-12-31           39 2007

since we arranged in the actual date in descending order, you guarantee that dropped rows for the unique combination of ID and year is the oldest. you can change the arrangement for the opposite; to get the oldest occuerence