Mapping tibble columns specified by substrings-CodePudding

I have a tibble with a list of stocks, each has a sector ids, each sector is a string with 8 characters (it is a level 4 GICS sector https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard):

tabl <- tibble(Stock=c("A","B","C","D"), SectorId=c("30101010", "30101010", "20103015", "55102010"))

I also have a tibble that map a SectorId to a ClusterId:

map_tabl <- tibble(ClusterId=c("C1","C1", "C2","C3"), SectorId=c("3010", "3020", "201030", "551020"))

Note that in the cluster mapping we have a mix of sectors defined on the 4 different levels (see https://en.wikipedia.org/wiki/Global_Industry_Classification_Standard), i.e. Sector "3010" contains sector "30101010". The first 2 characters correspond to Level 1, the first 4 to Level 2, the first 6 to Level 3, and the 8 characters to Level 4. So for example in this case "30101010" belongs to the higher level sector "3010", which is in ClusterId="C1". Note that "30101010" is not specified at all in map_tabl, so probably I should use a function that look at substrings, like grepl.

The resulting tibble should be:

tibble(Stock=c("A","B","C","D"), SectorId=c("30101010", "30101010", "20103015", "55102010"), ClusterId=c("C1", "C1", "C2", "C3")

CodePudding user response：

I think we can use a regex (fuzzy) join for this:

library(dplyr)
library(fuzzyjoin) # regex_left_join
map_tabl %>%
  mutate(SectorId = paste0("^", SectorId)) %>%
  regex_left_join(tabl, ., by = "SectorId")
# # A tibble: 4 x 4
#   Stock SectorId.x ClusterId SectorId.y
#   <chr> <chr>      <chr>     <chr>     
# 1 A     30101010   C1        ^3010     
# 2 B     30101010   C1        ^3010     
# 3 C     20103015   C2        ^201030   
# 4 D     55102010   C3        ^551020

fuzzyjoin always keeps both versions of the join variables around, it's easy enough to mutate(SectorId = SectorId.x, SectorId.x = NULL, SectorId.y = NULL) or similar (choosing select(-SectorId.x), etc, also works).

The precondition of SectorId to add the ^ is so that the matches only occur at the beginning of the string.

This does not attempt to limit the number of matches, so if there are multiple rows in map_tabl that might match an entry (e.g., SectorId=c("3010", "301010")), then you will need to define a clear way to choose which of these to retain. For this, I assume either Stock is a unique ID of sorts, or if not then you can add one yourself to make sure you end the operation with the same rows (no dupes) as before the join.

CodePudding user response：

You can match on the first n characters of tabl$SectorId according to map_tabl$SectorId.

m <- match(substr(tabl$SectorId, 1, nchar(map_tabl$SectorId)),
           map_tabl$SectorId)
tabl$ClusterId <- map_tabl$ClusterId[m]

output

# A tibble: 4 × 3
  Stock SectorId ClusterId
  <chr> <chr>    <chr>    
1 A     30101010 C1       
2 B     30101010 C1       
3 C     20103015 C2       
4 D     55102010 C3