Home > OS >  Removing NAs when mutating a new variable by_group (dplyr)
Removing NAs when mutating a new variable by_group (dplyr)

Time:05-23

I am working on a uni project with EU-SILC data. I want to create a new variable where all households are assigned to their corresponding housing cost group to create a stacked density plot with the income distribution in relation to housing cost.

I encountered two problems:

  1. I cannot create the variable hcost_group because my housing cost variable, which is the basis for assigning the households to the groups has 47 NAs (out of nearly 70.000 observations). I tried many different things to remove the NAs when creating the new variable but I keep getting an error message.
  2. As I don't want to generally remove the households for which I dont have housing cost the hcost_group variable will be shorter than my income variable - how can I just for the plot exclude the income of the households for which I don't have a housing cost?

Thanks a lot in advance!

Here is my code (inkl error messages) for creating the variable and the plot:

data <- data %>% filter(!is.na(hcost)) %>% group_by(country) %>% 
    mutate(hcost_group = quantcut(hcost, q=c(0.1, 0.2, 0.3, 0.4)))
Error: Problem with `mutate()` column `hcost_group`.
i `hcost_group = quantcut(hcost, q = c(0.1, 0.2, 0.3, 0.4))`.
x missing value where TRUE/FALSE needed
i The error occurred in group 6: country = "UK".
Run `rlang::last_error()` to see where the error occurred.

> 
> ggplot(data=data, aes(x=decile, group=hcost_group, fill=hcost_group))  
    geom_density(adjust=1.5, position="fill")  
    facet_wrap(~country) 
    xlab("Einkommensdezil") 
    ylab("Anteil der Gruppen nach Wohnkostenbelastung") 
    scale_fill_discrete(name = "Wohnkostenbelastung (Anteil der Wohnkosten am EK)",
                        labels = 
                          c("0-10%", "10-20%","20-30%",
                            "30-40%", "40-100%"))
Error in FUN(X[[i]], ...) : object 'hcost_group' not found

I alsoa tried "na.rm = TRUE", "na.omit()" and also "complete.cases".

CodePudding user response:

In the first problem, I believe that the issue is not NA... (you can't say without seeing the base), it seems that your quantcut function is missing the correct q parameter. Q waits an integer...

In the second problem make a data frame with the filtered data.

it would also not be possible to make your mutate before the group_by

CodePudding user response:

Does this have something to do with the random before the mutate() call?

data <- data %>%
    drop_na(hcost) %>%
    group_by(country) %>%
    mutate(
        hcost_group = quantcut(hcost, q = c(.1, .2, .3, .4))
    )

I would also ensure that hcost is stored as a numeric vector.

  • Related