Using Mutate and Case_When only to fill NA row-CodePudding

I've been goggling for hours and I'm not sure where to find the answer for something simple like this, so I hope this is not a duplicate question.

I have a large data frame (936848 x 12) with one column is a coded name from which I can derive the value of other column, in this case the year of manufactured based on the first character of column Code.

Small sample of the data frame:

df <- data.frame(Code = c("AX123", "CL199", "GH679"), 
                 Year = c(NA, "2014", "2018"))

I just want to mutate the column Year based on column Code only if the value is missing. I did not want to overwrite existing value in column Year.

Since this also involve identifying the first alphabet in the string in Code, I use case_when and startsWith:

df <- df %>%
  filter(is.na(Year)) %>%
  mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                          TRUE ~ NA_real_
                          ))

This will gives out this result:

   Code Year
1 AX123 2013

My problem is the way I write this filter out all non NA row in the dataframe. I want to keep the dataframe as it is, only to fill the NA row.

I'm thinking of nesting this to ifelse function, to mutate only if the column is NA, but I'm confusing myself on how to write it.

df <- df %>%
  mutate(ifelse(is.na(Year),
                case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                          TRUE ~ NA_real_
  )), "")

which will obviously give this error

Error: Problem with `mutate()` input `..1`.
i `..1 = ifelse(...)`.
x argument "no" is missing, with no default

I have a lot of similar tasks where I need to use ifelse, grepl, substring etc to detect the character in the code column and fill missing NA in other column. But because a lot of the row that are already fill with values are due to the exception from rules that does not follow the coded name convention, I did not want to overwrite them.

CodePudding user response：

A base R alternative:

# option 1: readable version
ix <- match(substr(df$Code[is.na(df$Year)],1,1), LETTERS[c(1,3:8,10)])
df$Year[is.na(df$Year)] <- ix   2012

# option 2: direct version
df$Year[is.na(df$Year)] <- match(substr(df$Code[is.na(df$Year)],1,1), LETTERS[c(1,3:8,10)])   2012

which gives the following result:

> df
   Code Year
1 AX123 2013
2 CL199 2014
3 GH679 2018

CodePudding user response：

You almost got it. ifelse requires 3 arguments:

test (in your case: is.na())
yes (in your case: replace with Year, as per starting character)
no (in your case: copy Year)

df %>%
  mutate(Year1 = ifelse(is.na(Year),
                case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                ), Year))

Output:

   Code Year Year1
1 AX123 <NA>  2013
2 CL199 2014  2014
3 GH679 2018  2018

Example for no matching letter, as asked for in the comments:

df <- data.frame(Code = c("AX123", "CL199", "GH679", "XX485"), 
                 Year = c(NA, "2014", "2018", NA))

df %>%
  mutate(Year1 = ifelse(is.na(Year),
                case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                ), Year))

Output

   Code Year Year1
1 AX123 <NA>  2013
2 CL199 2014  2014
3 GH679 2018  2018
4 XX485 <NA>  <NA>

CodePudding user response：

If you only want to manipulate a part of a dataframe, you can index parts of it in the left part of any "<-" assignment.

you can define these parts by brackets [] behind the dataframe:

df[rows,columns]

more on indexing: https://stats.oarc.ucla.edu/r/modules/subsetting-data/

In your case this could be:

df[is.na(df$Year),] <- df %>%
  filter(is.na(Year)) %>%
  mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
                          startsWith(Code, "C") ~ 2014,
                          startsWith(Code, "D") ~ 2015,
                          startsWith(Code, "E") ~ 2016,
                          startsWith(Code, "F") ~ 2017,
                          startsWith(Code, "G") ~ 2018,
                          startsWith(Code, "H") ~ 2019,
                          startsWith(Code, "J") ~ 2020,
                          TRUE ~ NA_real_))

CodePudding user response：

Here is a different approach, using a lookup-table and an update join. Should perform pretty fast.

df <- data.frame(Code = c("AX123", "CL199", "GH679"), 
                 Year = c(NA, 2014, 2018)) 

library(data.table)
# Create lookup table with regexes and years
lookup <- data.table(id = LETTERS[c(1,3:8,10)], newYear = 2013:2020)
# Make df a data.table  
setDT(df)
# Get the first letter of Code-column, to join on
df[, temp := substr(Code, 1, 1)]
# perform by-reference update join
df[is.na(Year), Year := lookup[df[is.na(Year), ], newYear, on = .(id = temp)]][]
# remove temp
df[, temp := NULL]
# Code Year
# 1: AX123 2013
# 2: CL199 2014
# 3: GH679 2018

CodePudding user response：

Here is an alternative approach:

Create a named vector replacement
create a pattern to match
Use an ifelse statement with str_detect and match


replacement <- 2013:2020
names(replacement) <- LETTERS[c(1, 3:9)]
pattern <- paste(names(replacement), collapse = '|')

library(dplyr)
library(stringr)

df %>% 
  mutate(helper = substring(Code, 1, 1),
         Year = ifelse(is.na(Year) & str_detect(helper, pattern), 
                       replacement[match(helper, names(replacement))], Year)) %>% 
  select(-helper)

  Code Year
1 AX123 2013
2 CL199 2014
3 GH679 2018