Home > Software design >  Fixing missing values in my data (a lot of missing values so can't impute)
Fixing missing values in my data (a lot of missing values so can't impute)

Time:11-12

I have a dataframe and in one of the columns i have quite a lot of missing data, i tried to impute these values but as there is so much that is missing it doesn't do a very good job. The column in question here has a value given for roughly every 5 years, normally i would just delete it but i want to see if i can salvage something from it. What im looking to go is to carry forward the value provided, filling in the NA until a new value appears. So for example in my data provided im looking for an output like this;

df
Country_Name year gdp_per_capita freshwaster_production_pc
Albania 1997       717.3800              4.543622e-07 #use this value
Albania 1998       813.7894              4.543622e-07
Albania 1999      1033.2425              4.543622e-07
Albania 2000      1126.6833              4.543622e-07
Albania 2001      1281.6598              4.543622e-07
Albania 2002      1425.1242              5.451047e-07 #new value so now we use this
Albania 2003      1846.1201              5.451047e-07
Albania 2004      2373.5813              5.451047e-07

Im also open to ideas on better ways to deal with this so feel free to suggest anything thats better. I thought about averaging the values and using them but i think this is a better way of showing the changes in time

sample data;

head(df, 30)
   Country_Name year gdp_per_capita freshwaster_production_pc
1                  Albania 1997       717.3800              4.543622e-07
2                  Albania 1998       813.7894                        NA
3                  Albania 1999      1033.2425                        NA
4                  Albania 2000      1126.6833                        NA
5                  Albania 2001      1281.6598                        NA
6                  Albania 2002      1425.1242              5.451047e-07
7                  Albania 2003      1846.1201                        NA
8                  Albania 2004      2373.5813                        NA
9                  Albania 2005      2673.7866                        NA
10                 Albania 2006      2972.7429                        NA
11                 Albania 2007      3595.0381              4.201121e-07
12                 Albania 2008      4370.5399                        NA
13                 Albania 2009      4114.1349                        NA
14                 Albania 2010      4094.3484                        NA
15                 Albania 2011      4437.1426                        NA
16                 Albania 2012      4247.6300              3.876498e-07
17                 Albania 2013      4413.0620                        NA
18                 Albania 2014      4578.6332                        NA
19                 Albania 2015      3952.8025                        NA
20                 Albania 2016      4124.0554                        NA
21                 Albania 2017      4531.0194              3.796820e-07
22                 Albania 2018      5287.6637              3.342199e-07
23                 Albania 2019      5396.2159                        NA
24                 Albania 2020      5332.1605                        NA
25                 Albania 2021      6494.3857                        NA
26                 Algeria 1997      1619.7977              1.773179e-07
27                 Algeria 1998      1596.0039                        NA
28                 Algeria 1999      1588.3489                        NA
29                 Algeria 2000      1765.0271                        NA
30                 Algeria 2001      1740.6067                        NA
31                 Algeria 2002      1781.8289              1.897217e-07
32                 Algeria 2003      2103.3813                        NA
33                 Algeria 2004      2610.1854                        NA
34                 Algeria 2005      3113.0949                        NA
35                 Algeria 2006      3478.7100                        NA

CodePudding user response:

library(zoo)
df$freshwaster_production_pc=na.locf(df$freshwaster_production_pc)

CodePudding user response:

You can also consider using linear interpolation with year as the independent variable. Linear interpolation will "draw a line" between values separated by year and use the equation of the line to assign values based on the year of the missing observation.

A convenient implementation for filling missing values in this matter is zoo::na.approx.

  •  Tags:  
  • r
  • Related