Update incomplete strings in dataset1 using complete strings in dataset2 where match of first few ch-CodePudding

I have the task of updating incomplete string fields (UK postcode data which includes the first few characters only for each postcode) by matching to complete UK postcode data from another dataset. Initial code to read the two datasets in to R is:

#Call necessary packages
library(tidyverse)
library (readxl)
    
#Read in the incomplete postcode data 
HomePostcode <- read_excel("C:/Users/JWP/IncompletePostcodeData.xlsx", 
                               sheet = "IncompletePostcodeData", col_types = c("skip", 
                                                                        "text", "skip"))
    
#Read in full UK postcode data
AllUKPostcodes <- read_excel("C:/Users/JWP/IncompletePostcodeData.xlsx", 
                               sheet = "AllUKPostcodesData", col_types = c("skip", 
                                                                        "text", "skip"))

HomePostcode data (example):

Postcode
PE2
LE7

AllUKPostcodes data (example):

Postcode
PE2 4LR
PE2 8BZ
LE7 2DL
LE7 2LY

What I need to do is have R update the incomplete postcodes from 'HomePostcode' with any row where there is an observed match from the first few characters between the two datasets - the first match would be sufficient but it could be any random match if that is easier.

So in this instance example output could be:

Postcode
PE2 4LR
LE7 2DL

Thank you in advance for any assistance.

CodePudding user response：

A dplyr solution would be to add a key to your AllUKPostcodes dataframe consisting of the characters before the first space using gsub and arbitrarily picking a row from that result.

You can then join your HomePostcode dataframe to it to get the correct output.

HomePostcode <- data.frame(Postcode = c("PE2", "LE7"))
AllUKPostcodes <- data.frame(Postcode = c("PE2 4LR",
                                          "PE2 8BZ",
                                          "LE7 2DL",
                                          "LE7 2LY"))
library(dplyr)
AllUKPostcodes %>%
  group_by(key = gsub("\\s. ", "", Postcode)) %>%
  filter(row_number() == 1) %>%
  inner_join(HomePostcode, by = c("key" = "Postcode")) %>%
  ungroup %>%
  select(Postcode)
#> # A tibble: 2 × 1
#>   Postcode
#>   <fct>   
#> 1 PE2 4LR 
#> 2 LE7 2DL

CodePudding user response：

As address/postcode data can be large enough that you notice how long operations take to run, I would use data.table for this.

library(data.table)

setDT(HomePostcode)
setDT(AllUKPostcodes)

setnames(HomePostcode, "Postcode", "Outcode")
AllUKPostcodes[, Outcode := gsub("\\s. ", "", Postcode)]

HomePostcode[AllUKPostcodes, 
    mult = "first", on = "Outcode", nomatch=0L
]

#    Outcode Postcode
#     <char>   <char>
# 1:     PE2  PE2 4LR
# 2:     PE2  PE2 8BZ
# 3:     LE7  LE7 2DL
# 4:     LE7  LE7 2LY

data.table syntax can be quite terse. For clarity, the actual function being called here is data.table::foverlaps. mult = "first" means return the first match if there are multiple (you could change it to "last" or "all"). You may also want to change the nomatch parameter:

When a row (with interval say, [a,b]) in x has no match in y, nomatch=NA (default) means NA is returned for y's non-by.y columns for that row of x. nomatch=NULL (or 0 for backward compatibility) means no rows will be returned for that row of x

Note that this approach joins on outcodes, using Outcode := gsub("\\s. ", "", Postcode)]. The assumption is that the incode and outcode are separated by a space (and that this is the first space that appears in the postcode).

If this is not the case, you will need a different method. For example, take the following codes:

codes  <- c("PE24LR", "SW1A 0AA", "PE28BZ", "LE7 2DL", "LE72LY", "SW1A2PQ", "SW1VE 4RZ")

In this case, I think the best thing to do would be to assume the incode will always be three characters (which it should be), and write a function based on VvdL's substr() approach:

get_last_n_chars  <- function(x, n) {
    x_length  <- nchar(x)
    substr_start  <- x_length - (n-1)
    substr(x, substr_start, x_length)    
}

get_last_n_chars(codes, 3)

# [1] "4LR" "0AA" "8BZ" "2DL" "2LY" "2PQ" "4RZ"