I have the task of updating incomplete string fields (UK postcode data which includes the first few characters only for each postcode) by matching to complete UK postcode data from another dataset. Initial code to read the two datasets in to R is:
#Call necessary packages
library(tidyverse)
library (readxl)
#Read in the incomplete postcode data
HomePostcode <- read_excel("C:/Users/JWP/IncompletePostcodeData.xlsx",
sheet = "IncompletePostcodeData", col_types = c("skip",
"text", "skip"))
#Read in full UK postcode data
AllUKPostcodes <- read_excel("C:/Users/JWP/IncompletePostcodeData.xlsx",
sheet = "AllUKPostcodesData", col_types = c("skip",
"text", "skip"))
HomePostcode data (example):
Postcode |
---|
PE2 |
LE7 |
AllUKPostcodes data (example):
Postcode |
---|
PE2 4LR |
PE2 8BZ |
LE7 2DL |
LE7 2LY |
What I need to do is have R update the incomplete postcodes from 'HomePostcode' with any row where there is an observed match from the first few characters between the two datasets - the first match would be sufficient but it could be any random match if that is easier.
So in this instance example output could be:
Postcode |
---|
PE2 4LR |
LE7 2DL |
Thank you in advance for any assistance.
CodePudding user response:
A dplyr
solution would be to add a key to your AllUKPostcodes
dataframe consisting of the characters before the first space using gsub
and arbitrarily picking a row from that result.
You can then join your HomePostcode
dataframe to it to get the correct output.
HomePostcode <- data.frame(Postcode = c("PE2", "LE7"))
AllUKPostcodes <- data.frame(Postcode = c("PE2 4LR",
"PE2 8BZ",
"LE7 2DL",
"LE7 2LY"))
library(dplyr)
AllUKPostcodes %>%
group_by(key = gsub("\\s. ", "", Postcode)) %>%
filter(row_number() == 1) %>%
inner_join(HomePostcode, by = c("key" = "Postcode")) %>%
ungroup %>%
select(Postcode)
#> # A tibble: 2 × 1
#> Postcode
#> <fct>
#> 1 PE2 4LR
#> 2 LE7 2DL
CodePudding user response:
As address/postcode data can be large enough that you notice how long operations take to run, I would use data.table
for this.
library(data.table)
setDT(HomePostcode)
setDT(AllUKPostcodes)
setnames(HomePostcode, "Postcode", "Outcode")
AllUKPostcodes[, Outcode := gsub("\\s. ", "", Postcode)]
HomePostcode[AllUKPostcodes,
mult = "first", on = "Outcode", nomatch=0L
]
# Outcode Postcode
# <char> <char>
# 1: PE2 PE2 4LR
# 2: PE2 PE2 8BZ
# 3: LE7 LE7 2DL
# 4: LE7 LE7 2LY
data.table
syntax can be quite terse. For clarity, the actual function being called here is data.table::foverlaps. mult = "first"
means return the first match if there are multiple (you could change it to "last"
or "all"
). You may also want to change the nomatch
parameter:
When a row (with interval say, [a,b]) in x has no match in y, nomatch=NA (default) means NA is returned for y's non-by.y columns for that row of x. nomatch=NULL (or 0 for backward compatibility) means no rows will be returned for that row of x
Note that this approach joins on outcodes, using Outcode := gsub("\\s. ", "", Postcode)]
. The assumption is that the incode and outcode are separated by a space (and that this is the first space that appears in the postcode).
If this is not the case, you will need a different method. For example, take the following codes:
codes <- c("PE24LR", "SW1A 0AA", "PE28BZ", "LE7 2DL", "LE72LY", "SW1A2PQ", "SW1VE 4RZ")
In this case, I think the best thing to do would be to assume the incode will always be three characters (which it should be), and write a function based on VvdL's substr()
approach:
get_last_n_chars <- function(x, n) {
x_length <- nchar(x)
substr_start <- x_length - (n-1)
substr(x, substr_start, x_length)
}
get_last_n_chars(codes, 3)
# [1] "4LR" "0AA" "8BZ" "2DL" "2LY" "2PQ" "4RZ"