Keeping companies with at least 3 years of data in R-CodePudding

I am working with a panel data with over 9K companies. Each company appears a few times as I have data for the period of 2010 - 2020.

dput() subset of data

structure(list(ISIN = c("AEA000201011", "AEA000201011", "AEA000201011", 
"AEA000201011", "AEA000201011", "AEA000201011", "AEA000201011", 
"AEA000201011", "AEA000201011", "AEA000201011", "AEA000201011", 
"AEA000801018", "AEA000801018", "AEA000801018", "AEA000801018", 
"AEA000801018", "AEA000801018", "AEA000801018", "AEA000801018", 
"AEA000801018", "AEA000801018", "AEA000801018", "AEA001501013", 
"AEA001501013", "AEA001501013"), year = c(2010L, 2011L, 2012L, 
2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 2019L, 2020L, 2010L, 
2011L, 2012L, 2013L, 2014L, 2015L, 2016L, 2017L, 2018L, 2019L, 
2020L, 2010L, 2011L, 2012L), full_company_name = c("Abu Dhabi Commercial Bank PJSC", 
"Abu Dhabi Commercial Bank PJSC", "Abu Dhabi Commercial Bank PJSC", 
"Abu Dhabi Commercial Bank PJSC", "Abu Dhabi Commercial Bank PJSC", 
"Abu Dhabi Commercial Bank PJSC", "Abu Dhabi Commercial Bank PJSC", 
"Abu Dhabi Commercial Bank PJSC", "Abu Dhabi Commercial Bank PJSC", 
"Abu Dhabi Commercial Bank PJSC", "Abu Dhabi Commercial Bank PJSC", 
"Abu Dhabi Islamic Bank PJSC", "Abu Dhabi Islamic Bank PJSC", 
"Abu Dhabi Islamic Bank PJSC", "Abu Dhabi Islamic Bank PJSC", 
"Abu Dhabi Islamic Bank PJSC", "Abu Dhabi Islamic Bank PJSC", 
"Abu Dhabi Islamic Bank PJSC", "Abu Dhabi Islamic Bank PJSC", 
"Abu Dhabi Islamic Bank PJSC", "Abu Dhabi Islamic Bank PJSC", 
"Abu Dhabi Islamic Bank PJSC", "Arabtec Holding PJSC", "Arabtec Holding PJSC", 
"Arabtec Holding PJSC"), refin_esg = c(NA, NA, NA, NA, 48.74, 
54.32, 63.81, 67.64, 71.74, 68.17, 63.8, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, 30.89, NA, NA, NA, NA), refin_e = c(NA, NA, NA, 
NA, 23.21, 53.02, 60.18, 18.64, 50.25, 52.21, 46.69, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, 20.52, NA, NA, NA, NA), esg_msci_hist = c(NA, 
NA, NA, NA, NA, 70, 84, 73, 74, 63, 71, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA), env_msci_hist = c(NA, NA, NA, 
NA, NA, 29, 17, 20, 19, 14, 26, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA), ESG_spg = c(NA, NA, NA, NA, NA, 35, 
36, 37, 26, 20, 15, NA, NA, NA, NA, NA, NA, NA, NA, NA, 8, 6, 
NA, NA, NA), E_spg = c(NA, NA, NA, NA, NA, 18, 26, 26, 15, 12, 
2, NA, NA, NA, NA, NA, NA, NA, NA, NA, 6, 0, NA, NA, NA)), row.names = c(NA, 
-25L), class = c("tbl_df", "tbl", "data.frame"))


    ISIN         year full_…¹ refin…² refin_e esg_m…³ env_m…⁴
   <chr>       <int> <chr>     <dbl>   <dbl>   <dbl>   <dbl>
 1 AEA0002010…  2010 Abu Dh…    NA      NA        NA      NA
 2 AEA0002010…  2011 Abu Dh…    NA      NA        NA      NA
 3 AEA0002010…  2012 Abu Dh…    NA      NA        NA      NA
 4 AEA0002010…  2013 Abu Dh…    NA      NA        NA      NA
 5 AEA0002010…  2014 Abu Dh…    48.7    23.2      NA      NA
 6 AEA0002010…  2015 Abu Dh…    54.3    53.0      70      29
 7 AEA0002010…  2016 Abu Dh…    63.8    60.2      84      17
 8 AEA0002010…  2017 Abu Dh…    67.6    18.6      73      20
 9 AEA0002010…  2018 Abu Dh…    71.7    50.2      74      19
10 AEA0002010…  2019 Abu Dh…    68.2    52.2      63      14

Exepected output (example)

ISIN         year full_…¹ refin…² refin_e esg_m…³ env_m…⁴
   <chr>       <int> <chr>     <dbl>   <dbl>   <dbl>   <dbl>
 1 AEA0002010…  2015 Abu Dh…    54.3    53.0      70      29
 2 AEA0002010…  2016 Abu Dh…    63.8    60.2      84      17
 3 AEA0002010…  2017 Abu Dh…    67.6    18.6      73      20
 4 AEA0002010…  2018 Abu Dh…    71.7    50.2      74      19
 5 AEA0002010…  2019 Abu Dh…    68.2    52.2      63      14

If this company would be left with less than 3 years of data when removing NAs for the six variables "refin_e", "refin_esg", "ESG_spg", "E_spg", "env_msci_hist", "esg_msci_hist", then I want to drop the company.

I want to keep only companies that have at least 3 non-consecutive years of non-NAs for the following variables: "refin_e", "refin_esg", "ESG_spg", "E_spg", "env_msci_hist", "esg_msci_hist"

I alredy filtered for min 3 consecutive years using the following code but I need to have them for not necessairly consecutive years.

panel_data1bis <- panel_data1 %>%
  group_by(ISIN) %>%
  filter(any(with(rle(if_all(c("refin_e", "refin_esg", "ESG_spg", "E_spg",
                               "env_msci_hist", "esg_msci_hist"), complete.cases)),
                  values & lengths >2))) %>%
  ungroup

Thank you if someone can help.

CodePudding user response：

You can use complete.cases(). This should be much quicker than using rowwise() on larger data. I used is.numeric as the criteria, but you could select the columns by hand if you wish.

This will be any 3 years. If you want them in order, there is a slight bit more logic.

library(dplyr)

df %>% 
  group_by(ISIN) %>% 
  filter(if_all(is.numeric, complete.cases),
         n() > 2)

CodePudding user response：

I would first check rowwise, whether all columns are filled and then sum up the new column.

df <- df %>% 
  group_by(ISIN) %>% 
  rowwise() %>% 
  mutate(keep =  all(!is.na(c_across(c("refin_e", "refin_esg", "ESG_spg", "E_spg", "env_msci_hist", "esg_msci_hist")))))

df <- df %>% 
  group_by(ISIN) %>% 
  mutate(sum_keep = sum(keep)) %>%
  filter(sum_keep > 2)