I've been goggling for hours and I'm not sure where to find the answer for something simple like this, so I hope this is not a duplicate question.
I have a large data frame (936848 x 12) with one column is a coded name from which I can derive the value of other column, in this case the year of manufactured based on the first character of column Code.
Small sample of the data frame:
df <- data.frame(Code = c("AX123", "CL199", "GH679"),
Year = c(NA, "2014", "2018"))
I just want to mutate the column Year based on column Code only if the value is missing. I did not want to overwrite existing value in column Year.
Since this also involve identifying the first alphabet in the string in Code, I use case_when
and startsWith
:
df <- df %>%
filter(is.na(Year)) %>%
mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
startsWith(Code, "C") ~ 2014,
startsWith(Code, "D") ~ 2015,
startsWith(Code, "E") ~ 2016,
startsWith(Code, "F") ~ 2017,
startsWith(Code, "G") ~ 2018,
startsWith(Code, "H") ~ 2019,
startsWith(Code, "J") ~ 2020,
TRUE ~ NA_real_
))
This will gives out this result:
Code Year
1 AX123 2013
My problem is the way I write this filter out all non NA row in the dataframe. I want to keep the dataframe as it is, only to fill the NA row.
I'm thinking of nesting this to ifelse function, to mutate only if the column is NA, but I'm confusing myself on how to write it.
df <- df %>%
mutate(ifelse(is.na(Year),
case_when(startsWith(Code, "A") ~ 2013,
startsWith(Code, "C") ~ 2014,
startsWith(Code, "D") ~ 2015,
startsWith(Code, "E") ~ 2016,
startsWith(Code, "F") ~ 2017,
startsWith(Code, "G") ~ 2018,
startsWith(Code, "H") ~ 2019,
startsWith(Code, "J") ~ 2020,
TRUE ~ NA_real_
)), "")
which will obviously give this error
Error: Problem with `mutate()` input `..1`.
i `..1 = ifelse(...)`.
x argument "no" is missing, with no default
I have a lot of similar tasks where I need to use ifelse
, grepl
, substring
etc to detect the character in the code column and fill missing NA in other column. But because a lot of the row that are already fill with values are due to the exception from rules that does not follow the coded name convention, I did not want to overwrite them.
CodePudding user response:
A base R alternative:
# option 1: readable version
ix <- match(substr(df$Code[is.na(df$Year)],1,1), LETTERS[c(1,3:8,10)])
df$Year[is.na(df$Year)] <- ix 2012
# option 2: direct version
df$Year[is.na(df$Year)] <- match(substr(df$Code[is.na(df$Year)],1,1), LETTERS[c(1,3:8,10)]) 2012
which gives the following result:
> df Code Year 1 AX123 2013 2 CL199 2014 3 GH679 2018
CodePudding user response:
You almost got it. ifelse
requires 3 arguments:
- test (in your case:
is.na()
) - yes (in your case: replace with Year, as per starting character)
- no (in your case: copy
Year
)
df %>%
mutate(Year1 = ifelse(is.na(Year),
case_when(startsWith(Code, "A") ~ 2013,
startsWith(Code, "C") ~ 2014,
startsWith(Code, "D") ~ 2015,
startsWith(Code, "E") ~ 2016,
startsWith(Code, "F") ~ 2017,
startsWith(Code, "G") ~ 2018,
startsWith(Code, "H") ~ 2019,
startsWith(Code, "J") ~ 2020,
), Year))
Output:
Code Year Year1
1 AX123 <NA> 2013
2 CL199 2014 2014
3 GH679 2018 2018
Example for no matching letter, as asked for in the comments:
df <- data.frame(Code = c("AX123", "CL199", "GH679", "XX485"),
Year = c(NA, "2014", "2018", NA))
df %>%
mutate(Year1 = ifelse(is.na(Year),
case_when(startsWith(Code, "A") ~ 2013,
startsWith(Code, "C") ~ 2014,
startsWith(Code, "D") ~ 2015,
startsWith(Code, "E") ~ 2016,
startsWith(Code, "F") ~ 2017,
startsWith(Code, "G") ~ 2018,
startsWith(Code, "H") ~ 2019,
startsWith(Code, "J") ~ 2020,
), Year))
Output
Code Year Year1
1 AX123 <NA> 2013
2 CL199 2014 2014
3 GH679 2018 2018
4 XX485 <NA> <NA>
CodePudding user response:
If you only want to manipulate a part of a dataframe, you can index parts of it in the left part of any "<-" assignment.
you can define these parts by brackets []
behind the dataframe:
df[rows,columns]
more on indexing: https://stats.oarc.ucla.edu/r/modules/subsetting-data/
In your case this could be:
df[is.na(df$Year),] <- df %>%
filter(is.na(Year)) %>%
mutate(Year = case_when(startsWith(Code, "A") ~ 2013,
startsWith(Code, "C") ~ 2014,
startsWith(Code, "D") ~ 2015,
startsWith(Code, "E") ~ 2016,
startsWith(Code, "F") ~ 2017,
startsWith(Code, "G") ~ 2018,
startsWith(Code, "H") ~ 2019,
startsWith(Code, "J") ~ 2020,
TRUE ~ NA_real_))
CodePudding user response:
Here is a different approach, using a lookup-table and an update join. Should perform pretty fast.
df <- data.frame(Code = c("AX123", "CL199", "GH679"),
Year = c(NA, 2014, 2018))
library(data.table)
# Create lookup table with regexes and years
lookup <- data.table(id = LETTERS[c(1,3:8,10)], newYear = 2013:2020)
# Make df a data.table
setDT(df)
# Get the first letter of Code-column, to join on
df[, temp := substr(Code, 1, 1)]
# perform by-reference update join
df[is.na(Year), Year := lookup[df[is.na(Year), ], newYear, on = .(id = temp)]][]
# remove temp
df[, temp := NULL]
# Code Year
# 1: AX123 2013
# 2: CL199 2014
# 3: GH679 2018
CodePudding user response:
Here is an alternative approach:
- Create a named vector
replacement
- create a
pattern
to match - Use an
ifelse
statement withstr_detect
andmatch
replacement <- 2013:2020
names(replacement) <- LETTERS[c(1, 3:9)]
pattern <- paste(names(replacement), collapse = '|')
library(dplyr)
library(stringr)
df %>%
mutate(helper = substring(Code, 1, 1),
Year = ifelse(is.na(Year) & str_detect(helper, pattern),
replacement[match(helper, names(replacement))], Year)) %>%
select(-helper)
Code Year
1 AX123 2013
2 CL199 2014
3 GH679 2018