Extract a string from text based on joined tables using vapply in R-CodePudding

I have a problem solving this real scenario. I have two data tables:

dfEntitiesCountries: a table of 3-letter entities vs their Routing Countries
temp: a big table of shipments each with the following details: a. AWB: identifier of the shipment b. OriginCountry: full name of a shipment origin country c. Route: the entities thru which a shipment has moved separated by a hyphen

dfEntitiesCountries

temp table

I need to figure out the Gateway of each shipment. Gateway is the entity in the Route with Routing Country that comes right after the OriginCountry of the shipment. This means that I need to exclude the first entities in route that correspond to the OriginCountry of the shipment, then take the entity in route whose RountingCountry is different from the OriginCountry.

I wrote the below code. The problem is that it works well if I use it as a standalone function (GetGW). But when I use it in the vapply to apply it on all shipments in temp table I face an error saying:

“Error in [.data.table(dfEntitiesCountries[RouteEntities, on = .(Entity = V1), : i evaluates to a logical vector length 3 but there are 2 rows. Recycling of logical i is no longer allowed as it hides more bugs than is worth the rare convenience. Explicitly use rep(...,length=.N) if you really need to recycle.”

GetGW <- function(RouteEntities,ShptOrgCnt){
  setDT(as.list(RouteEntities))
  GW <-dfEntitiesCountries[RouteEntities,on=.(Entity=V1),nomatch=0][RoutingCountry!=ShptOrgCnt,.(Entity)][[1]][1]%>% as.character()
}

temp[,`:=`(GateWay=vapply(strsplit(Route,"-"),GetGW,"character",ShptOrgCnt=OriginCountry))]

Appreciate any thought to resolve this case using the same method or any other efficient method as I have around 2M shipments in the actual table.

CodePudding user response：

Not a full answer, but would highly recommend looking at tidyverse's stringr package which has a lot of tools that could help with this. Combine with dplyr::across() and/or dplyr::rowwise() to apply to your whole dataset.

CodePudding user response：

Step 1: identify the first abbreviation of each country:

dfEntitiesCountries[dfEntitiesCountries, Entity2 := first(i.Entity), on = .(RoutingCountry), by = .EACHI]
dfEntitiesCountries
#    Entity RoutingCountry Entity2
#    <char>         <char>  <char>
# 1:    AMM              J     AMM
# 2:    DXB            UAE     DXB
# 3:    RUH             SA     JED
# 4:    DOH              Q     DOH
# 5:    LON             UK     LON
# 6:    JED             SA     JED

Step 2: form a temp2 that calculates the gateway for each AWB:

temp2 <- copy(temp)[dfEntitiesCountries, Orig := i.Entity2, on = .(OriginCountry = RoutingCountry)
  ][, strsplit(Route, "-"), by = .(AWB, Orig)
  ][dfEntitiesCountries, Gateway := i.Entity2, on = .(V1 = Entity)
  ][Orig != Gateway,
  ][, .SD[1,], by = .(AWB)]

Step 3: bring this back into the original temp:

temp[temp2, Gateway2 := i.Gateway, on = .(AWB)]
temp
#      AWB OriginCountry       Route GateWay Gateway2
#    <num>        <char>      <char>  <char>   <char>
# 1:  1111             Q     DOH-LON     LON      LON
# 2:  2222            SA RUH-JED-DXB     DXB      DXB
# 3:  3333           UAE DXB-AMM-RUH     AMM      AMM

Data

dfEntitiesCountries <- setDT(structure(list(Entity = c("AMM", "DXB", "RUH", "DOH", "LON", "JED"), RoutingCountry = c("J", "UAE", "SA", "Q", "UK", "SA")), row.names = c(NA, -6L), class = c("data.table", "data.frame")))
temp <- setDT(structure(list(AWB = c(1111, 2222, 3333), OriginCountry = c("Q", "SA", "UAE"), Route = c("DOH-LON", "RUH-JED-DXB", "DXB-AMM-RUH"), GateWay = c("LON", "DXB", "AMM")), row.names = c(NA, -3L), class = c("data.table", "data.frame")))