I have a dataset which is filled with values for certain rows but then missing for others. I want to fill the missing data based on the values from other columns.
df=
OFFENSE GROUP DESCRIPTION UCR
0 3126 NaN ASSAULT NaN
1 3831 NaN PROPERTY DAMAGE NaN
2 724 NaN AUTO THEFT NaN
3 301 NaN ROBBERY NaN
4 619 NaN LARCENY ALL OTHERS NaN
245865 3115 Aggravated Assault ASSAULT Part One
245866 619 Larceny LARCENY ALL OTHERS Part One
245867 2629 Harassment HARASSMENT Part Two
245868 2629 Harassment HARASSMENT Part Two
245869 3208 Property Lost PROPERTY - MISSING Part Three
For example, LARCENY ALL OTHERS should have GROUP as "Larceny" as well as the UCR "Part One". How do I fill these values with their correct values based on the filled columns?
I'm personally only interested in the ones with UCR "Part One", so I have tried filtering out the rows with UCR == 'Part One' and then made a list of the unique values in the DESCRIPTION column. I was then hoping to use a function to fill the other data based on if the values are in that list. But I cant think of what function to do.
dftest <- df[df$UCR == 'Part One',]
offenseslist <- unique(dftest$offence_name)
df %>% mutate(UCR =
case_when(df3$offence_name = offenseslist ~ "Part one")
CodePudding user response:
You can replace the string 'NaN' with NA using NA_if()
, then sort (arrange
) the data by the desired columns so that NA values per GROUP and UCR come last and finally fill
NA with the values one row above.
Example data df:
df <- structure(list(ID = c(0L, 1L, 2L, 3L, 4L, 245865L, 245866L, 245867L,
245868L, 245869L), OFFENSE = c(3126L, 3831L, 724L, 301L, 619L,
3115L, 619L, 2629L, 2629L, 3208L), GROUP = c("NaN", "NaN", "NaN",
"NaN", "NaN", "Aggravated Assault", "Larceny", "Harassment",
"Harassment", "Property Lost"), DESCRIPTION = c("ASSAULT", "PROPERTY DAMAGE",
"AUTO THEFT", "ROBBERY", "LARCENY ALL OTHERS", "ASSAULT", "LARCENY ALL OTHERS",
"HARASSMENT", "HARASSMENT", "PROPERTY - MISSING"), UCR = c("NaN",
"NaN", "NaN", "NaN", "NaN", "Part One", "Part One", "Part Two",
"Part Two", NA)), class = "data.frame", row.names = c(NA, 10L
))
code:
library(tidyr)
library(dplyr)
df %>%
na_if('NaN') %>%
arrange(DESCRIPTION, GROUP, UCR) %>%
fill(GROUP, UCR, .direction = 'down')
Note that fill
only targets NA, hence the initial replacement of 'NaN' with NA.