Home > Software engineering >  How does plm() function in R and panelOLS() in Python handle missing values
How does plm() function in R and panelOLS() in Python handle missing values

Time:03-04

I am building a model, using plm() package.

One of my x variables contains NAs because I used a t-1 lag calculations.

My R code looks like this

panel_df <- pdata.frame(df, index = c("AUTHOR_ID", "Year"), drop.index = TRUE, row.names = TRUE)

plmFit1 <- plm(y~ x1   x2   x3_t_1, data = panel_df, effect = 'twoways')

The best thing I found out in documentation online is

The data are not necessarily made consecutive (regular time series with distance 1), because balancedness does not imply consecutiveness. For making the data consecutive, use make.pconsecutive() (and, optionally, set argument balanced = TRUE to make consecutive and balanced, see also Examples for a comparison of the two functions. Note: Rows of (p)data.frames (elements for pseries) with NA values in individual or time index are not examined but silently dropped before the data are made balanced. In this case, it cannot be inferred which individual or time period is meant by the missing value(s) (see also Examples). Especially, this means: NA values in the first/last position of the original time periods for an individual are dropped, which are usually meant to depict the beginning and ending of the time series for that individual. Thus, one might want to check if there are any NA values in the index variables before applying make.pbalanced, and especially check for NA values in the first and last position for each individual in original data and, if so, maybe set those to some meaningful begin/end value for the time series.

I also did not find anything for panelOLS.

How do they handle missing values by default, because I receive output with coefficients?

Appreciate any tips!

CodePudding user response:

I cannot comment on Python's panelOLS but would assume it is similar.

plm follows standard lm behaviour: drop observations (lines) with NA value prior to estimation. The documentation you cite is not related to this behaviour.

Compare your data pre estimation (df, panel_df) and data post estimation (as found the in the model object in $model).

You can also look at ?na.omit and reach the described behaviour for na.omit (other approaches described there are not supported by plm).

Here is an example:

library(plm)
data(Grunfeld) 
pdf <- pdata.frame(Grunfeld)

head(pdf)
#>        firm year   inv  value capital
#> 1-1935    1 1935 317.6 3078.5     2.8
#> 1-1936    1 1936 391.8 4661.7    52.6
#> 1-1937    1 1937 410.6 5387.1   156.9
#> 1-1938    1 1938 257.7 2792.2   209.2
#> 1-1939    1 1939 330.8 4313.2   203.4
#> 1-1940    1 1940 461.2 4643.9   207.2

pdf[3, "inv"] <- NA # set one value to NA in 3rd row (1-1937)
head(pdf)
#>        firm year   inv  value capital
#> 1-1935    1 1935 317.6 3078.5     2.8
#> 1-1936    1 1936 391.8 4661.7    52.6
#> 1-1937    1 1937    NA 5387.1   156.9
#> 1-1938    1 1938 257.7 2792.2   209.2
#> 1-1939    1 1939 330.8 4313.2   203.4
#> 1-1940    1 1940 461.2 4643.9   207.2
nrow(pdf) # 200
#> [1] 200

mod <- plm(inv ~ value   capital, data = pdf, model = "within")

head(mod$model) # no entry for 1-1937
#>          inv  value capital
#> 1-1935 317.6 3078.5     2.8
#> 1-1936 391.8 4661.7    52.6
#> 1-1938 257.7 2792.2   209.2
#> 1-1939 330.8 4313.2   203.4
#> 1-1940 461.2 4643.9   207.2
#> 1-1941 512.0 4551.2   255.2

nrow(mod$model) # 199 rows
#> [1] 199
  • Related