Home > Mobile >  Filtering data and saving it using base R/dplyr
Filtering data and saving it using base R/dplyr

Time:10-29

I have a dataset which I am trying to filter get a subset of data based on categories.

df_clean = filter(df_clean, City  %in% c("Chicago","312CHICAGO", "CCHICAGO", "CHCHICAGO",
                              "CHCICAGO","chicago", "Chicago", "CHicago", "CHICAGO",
                              "CHICAGOC","CHICAGOCHICAGO", "CHICAGOI",
                              "CHICAGOO", "312CHICAGO"  ))

City is a categorical variable with many different levels (cities), I'd like to filter it only to show Chicago (and associated misspellings found in the dataset. The filter option appears to not be working as when I check the levels after filtering it gives me back the same levels as before applying the filter. No clue what I am doing wrong.

I've also tried filtering another column/categorical variable, Risk, and this is also not working. Risk has the following levels.

Risk 1 (High), Risk 2 (Medium), Risk 3 (Low), ALL, Null

I had to resort to using droplevels(df_Clean$Risk) which worked but I am not sure why.

df_clean = df_clean [df_clean $Risk %in% c("Risk 1 (High)", "Risk 2 (Medium)", "Risk 3 (Low)"),] 

Clearly I am confused when it comes to filtering, what am I doing wrong?

CodePudding user response:

When a column is defined as a factor, it is really an integer index value pointing a dictionary of strings. Just because you filtered out of a factor(s) from the column, the dataframe still keeps that value (and index) in the column's dictionary is case it is added back in the future.
So in order to remove unused levels one needs to use the droplevels() function to remove the used levels and reassign the remaining ones.

Maybe this code will demonstrate:

demo <- data.frame(id=c(1, 2, 3), animal=c("dog", "cat", "pig"), stringsAsFactors = TRUE)
str(demo)
#1=cat, 2=dog, 3=pig
# factors are sorted in alphabetical order
as.integer(demo$animal)

#remove one factor
reduced <- demo[demo$animal != "cat",]
reduced
as.integer(reduced$animal)
#still 1=cat, 2=dog, 3=pig

#drop level
reduced$animal<- droplevels(reduced$animal)
as.integer(reduced$animal)
#Now 1=dog, 2=pig
  • Related