Get codes for lowered forms in the same data.table: Faster implementation wanted-CodePudding

I have a data.table x with approx. 2M rows and 2 columns: entry and code. entry contains character values that might be repeated elsewhere in entry in lowered form. Also, all-lower entries exist, that do not have a non-lowered form. Each row in x has a unique code in column code. I've created a reproducible example with only 6 rows.

library(data.table)
x <- data.table(entry = c("Aaa", "Bbb", "Ccc", "aaa", "bbb", "ddd"),
                code = c(1, 2, 3, 4, 5, 6))
x
   entry code
1:   Aaa    1
2:   Bbb    2
3:   Ccc    3
4:   aaa    4
5:   bbb    5
6:   ddd    6

As you can see, Aaa and Bbb also come with their lowered "counterparts" aaa and bbb. Ccc does not have a lowered form and ddd does not have a non-lowered form.

I now want to assign a new column low_code which holds the code for the lowered counterpart if there is one. If not, the code should remain the same. This is what my goal would look like in this example:

   entry code low_code
1:   Aaa    1        4
2:   Bbb    2        5
3:   Ccc    3        3
4:   aaa    4        4
5:   bbb    5        5
6:   ddd    6        6

As you can see, the codes for aaa and bbb get assigned for Aaa and Bbb. For Ccc, there is no lowered version, so the code remains the same.

What I have done so far is this (and this works):

x$low_code <- sapply(x$entry, USE.NAMES = F, FUN = function (e) {
  low_e <- tolower(e)
  ret <- x[x$entry == low_e,][["code"]]
  if (length(ret) == 0) { # If no low entry is found... 
    ret <- x[x$entry == e,][["code"]] # ...return original code
  } 
  stopifnot(length(ret) %in% c(0, 1)) # Too many return values (should never happen)
  ret
})

However, it is extremely slow for my original data.table with more than 2 million rows. I suspect this is due to the indexing for every single entry (e in the code above).

I would be very surprised if there isn't a considerably faster option (maybe leveraging data.table syntax) but I am out of ideas. Any help is much appreciated - if you have the time, please also explain your solution with a few words so I can learn from it. As always, I hope that I didn't overlook any answers solving a similar problem. Thank you so much!

CodePudding user response：

What about the following?

rows = grep("^[A-Z]", x$entry)
x[(rows), lentry := tolower(entry)]
x[(rows), low_code := x[(-rows)][.SD, on = .(entry = lentry), x.code]]
rm(rows)
x[, lentry := NULL]
x[is.na(low_code), low_code := code]

#     entry  code low_code
#    <char> <num>    <num>
# 1:    Aaa     1        4
# 2:    Bbb     2        5
# 3:    Ccc     3        3
# 4:    aaa     4        4
# 5:    bbb     5        5
# 6:    ddd     6        6

CodePudding user response：

library(data.table)
x <- data.table(entry = c("Aaa", "Bbb", "Ccc", "aaa", "bbb", "ddd"), code = c(1, 2, 3, 4, 5, 6))
x[, low_entry := tolower(entry)
  ][entry == low_entry, low_entry := NA]
x[x, low_code := i.code, on = .(low_entry == entry)
  ][, low_entry := NULL
  ][, low_code := fcoalesce(low_code, code)]
x
#     entry  code low_code
#    <char> <num>    <num>
# 1:    Aaa     1        4
# 2:    Bbb     2        5
# 3:    Ccc     3        3
# 4:    aaa     4        4
# 5:    bbb     5        5
# 6:    ddd     6        6

This should work even if there are multiple lower entry values. For instance,

x <- data.table(entry = c("Aaa", "Bbb", "Ccc", "aaa", "bbb", "ddd", "aaa"), code = c(1, 2, 3, 4, 5, 6, 7))
x[, low_entry := tolower(entry)
  ][entry == low_entry, low_entry := NA]
x[x, low_code := i.code, on = .(low_entry == entry)
  ][, low_entry := NULL
  ][, low_code := fcoalesce(low_code, code)]
x
#     entry  code low_code
#    <char> <num>    <num>
# 1:    Aaa     1        7   # <-- reflects the _last_ `"aaa"` found
# 2:    Bbb     2        5
# 3:    Ccc     3        3
# 4:    aaa     4        4   # <-- own `code`
# 5:    bbb     5        5
# 6:    ddd     6        6
# 7:    aaa     7        7   # <-- own `code`

CodePudding user response：

merge full dataset with all lower-case subset, non-matches NAs get updated with existing code:

merge(x[, .(rn = 1:.N, entry_low = tolower(entry), entry, code) ],
      x[ tolower(entry) == entry, .(entry_low = entry, code_low = code)], 
      by = "entry_low", all.x = TRUE
      )[ is.na(code_low), code_low := code 
         ][ order(rn), .(entry, code, code_low)]
#    entry code code_low
# 1:   Aaa    1        4
# 2:   Bbb    2        5
# 3:   Ccc    3        3
# 4:   aaa    4        4
# 5:   bbb    5        5
# 6:   ddd    6        6

CodePudding user response：

If you want to keep the original data.table untouched this might help.

first generate the tolower grouping entries,
then select the corresponding code values and
last put code for NA values if no grouping was found.

x[, .(code, entry, l_entry = tolower(entry))
  ][, .(entry, l = last(code[l_entry == entry]), code), by = l_entry
  ][, .(entry, code, low_code = ifelse(is.na(l), code, l))]
   entry code low_code
1:   Aaa    1        4
2:   aaa    4        4
3:   Bbb    2        5
4:   bbb    5        5
5:   Ccc    3        3
6:   ddd    6        6

If more lower entries (e.g. aaa) with different values are present it takes the last one. Can be changed to first, too.