How can apply a loess function and get predictions by groups using dplyr in r?-CodePudding

I have this example data set:

data.1 <-read.csv(text = "
country,year,response
Austria,2010,34378
Austria,2011,38123
Austria,2012,37126
Austria,2013,42027
Austria,2014,43832
Austria,2015,56895
Austria,2016,49791
Austria,2017,64467
Austria,2018,67620
Austria,2019,69210
Croatia,2010,56456
Croatia,2011,58896
Croatia,2012,54109
Croatia,2013,47156
Croatia,2014,47104
Croatia,2015,88867
Croatia,2016,78614
Croatia,2017,85133
Croatia,2018,77090
Croatia,2019,78330
France,2010,50939
France,2011,41571
France,2012,37367
France,2013,42999
France,2014,75789
France,2015,122529
France,2016,136518
France,2017,141829
France,2018,153850
France,2019,163800
")

I want to adjust a loess function by country and also obtain the predicted values for each year in the data frame I am providing. The loess smoothing looks like this:

ggplot(data.1, aes(x=year, y=response, color=country))  
  geom_point(size = 3, alpha=0.3)   
  #geom_line(aes(x=year, y=area_harvested_ha/1000), size=0.5, alpha= 1)  
  geom_smooth(method = 'loess', span=0.75, na.rm = T, se=F, size = 2)

Plot:

This is the code I tried to get the prediction:

data.1.with.pred <- data.1 %>% 
  group_by(country) %>% 
  arrange(country, year) %>% 
  mutate(pred.response = stats::predict(stats::loess(response ~ year, span = .75, data=.),
                         data.frame(year = seq(min(year), max(year), 1))))

I am getting the predictions in the data frame but the grouping by country is not working.

This is the plot:

ggplot(data.1.with.pred, aes(x=year, y=pred.response, color=country))  
  geom_point(aes(x=year, y=response), size = 3, alpha=0.3)   
  #geom_line(aes(x=year, y=area_harvested_ha/1000), size=0.5, alpha= 1)  
  geom_smooth(method = 'loess', span=0.75, na.rm = T, se=F, size = 2)

The problem I have is that the grouping by country is failing. I got this answer from here:

Is this what you were trying to do?

CodePudding user response：

Use the loess function to make a model of the subset of your data like this:

#use a loess model on a subset of the data (France)
    model <- loess(formula = response ~ year,data = subset(data.1,country == "France"))

#plot
    ggplot()  
      geom_point(data = data.1,
                 mapping = aes(x=year, y=response, color=country),size = 3, alpha=0.3)   
      geom_line(aes(model$x,model$fitted))  
      geom_smooth(method = 'loess', span=0.75, na.rm = T, se=F, size = 2)

Fitted values are in model$fitted

CodePudding user response：

The problem here is the group_by is not playing well with the mutate/predict function.

In this solution, I split the dataframe, calculated each prediction, then combined and plotted:

#split by country
sdata <-split(data.1, data.1$country)
#calculate the predicted values for each country
data.1.with.pred <- lapply(sdata, function(df){
   df$pred.response  <-stats::predict(stats::loess(response ~ year, span = .75, data=df))
   df
})

#merge back into 1 dataframe
data.1.with.pred <-dplyr::bind_rows(data.1.with.pred )

#data.1.with.pred[order(data.1.with.pred$year),]

ggplot(data.1.with.pred, aes(x=year, y=pred.response, color=country))  
   geom_point(aes(x=year, y=response), size = 3, alpha=0.3)   
   #geom_line(aes(x=year, y=area_harvested_ha/1000), size=0.5, alpha= 1)  
   geom_smooth(method = 'loess', span=0.75, na.rm = T, se=F, size = 2)

CodePudding user response：

Similar to Henry Holm's answer:

library(purrr)

model <- data.1 %>% 
  split(f = .$country) %>% 
  map(~stats::loess(response ~ year, span = .75, data=.x))

creates a model for each country. Now you can access the fitted values via

model$Austria$fitted
#>  [1] 35195.78 36149.17 37988.25 40221.17 47372.73 51220.11 55611.14 61368.08 66159.05 70242.01
model$Croatia$fitted
#>  [1] 59333.25 53963.12 49872.81 45156.89 57061.66 76289.39 86357.84 84047.18 81245.77 76487.97
model$France$fitted
#>  [1]  53011.15  37627.29  35403.63  45360.31  78379.48 117055.05 137189.73 146822.95 155585.16 162336.60