I have a data.table x
with approx. 2M rows and 2 columns: entry
and code
. entry
contains character values that might be repeated elsewhere in entry
in lowered form. Also, all-lower entries exist, that do not have a non-lowered form. Each row in x
has a unique code in column code
. I've created a reproducible example with only 6 rows.
library(data.table)
x <- data.table(entry = c("Aaa", "Bbb", "Ccc", "aaa", "bbb", "ddd"),
code = c(1, 2, 3, 4, 5, 6))
x
entry code
1: Aaa 1
2: Bbb 2
3: Ccc 3
4: aaa 4
5: bbb 5
6: ddd 6
As you can see, Aaa
and Bbb
also come with their lowered "counterparts" aaa
and bbb
. Ccc
does not have a lowered form and ddd
does not have a non-lowered form.
I now want to assign a new column low_code
which holds the code for the lowered counterpart if there is one. If not, the code should remain the same. This is what my goal would look like in this example:
entry code low_code
1: Aaa 1 4
2: Bbb 2 5
3: Ccc 3 3
4: aaa 4 4
5: bbb 5 5
6: ddd 6 6
As you can see, the codes for aaa
and bbb
get assigned for Aaa
and Bbb
. For Ccc
, there is no lowered version, so the code remains the same.
What I have done so far is this (and this works):
x$low_code <- sapply(x$entry, USE.NAMES = F, FUN = function (e) {
low_e <- tolower(e)
ret <- x[x$entry == low_e,][["code"]]
if (length(ret) == 0) { # If no low entry is found...
ret <- x[x$entry == e,][["code"]] # ...return original code
}
stopifnot(length(ret) %in% c(0, 1)) # Too many return values (should never happen)
ret
})
However, it is extremely slow for my original data.table with more than 2 million rows. I suspect this is due to the indexing for every single entry (e
in the code above).
I would be very surprised if there isn't a considerably faster option (maybe leveraging data.table syntax) but I am out of ideas. Any help is much appreciated - if you have the time, please also explain your solution with a few words so I can learn from it. As always, I hope that I didn't overlook any answers solving a similar problem. Thank you so much!
CodePudding user response:
What about the following?
rows = grep("^[A-Z]", x$entry)
x[(rows), lentry := tolower(entry)]
x[(rows), low_code := x[(-rows)][.SD, on = .(entry = lentry), x.code]]
rm(rows)
x[, lentry := NULL]
x[is.na(low_code), low_code := code]
# entry code low_code
# <char> <num> <num>
# 1: Aaa 1 4
# 2: Bbb 2 5
# 3: Ccc 3 3
# 4: aaa 4 4
# 5: bbb 5 5
# 6: ddd 6 6
CodePudding user response:
library(data.table)
x <- data.table(entry = c("Aaa", "Bbb", "Ccc", "aaa", "bbb", "ddd"), code = c(1, 2, 3, 4, 5, 6))
x[, low_entry := tolower(entry)
][entry == low_entry, low_entry := NA]
x[x, low_code := i.code, on = .(low_entry == entry)
][, low_entry := NULL
][, low_code := fcoalesce(low_code, code)]
x
# entry code low_code
# <char> <num> <num>
# 1: Aaa 1 4
# 2: Bbb 2 5
# 3: Ccc 3 3
# 4: aaa 4 4
# 5: bbb 5 5
# 6: ddd 6 6
This should work even if there are multiple lower entry
values. For instance,
x <- data.table(entry = c("Aaa", "Bbb", "Ccc", "aaa", "bbb", "ddd", "aaa"), code = c(1, 2, 3, 4, 5, 6, 7))
x[, low_entry := tolower(entry)
][entry == low_entry, low_entry := NA]
x[x, low_code := i.code, on = .(low_entry == entry)
][, low_entry := NULL
][, low_code := fcoalesce(low_code, code)]
x
# entry code low_code
# <char> <num> <num>
# 1: Aaa 1 7 # <-- reflects the _last_ `"aaa"` found
# 2: Bbb 2 5
# 3: Ccc 3 3
# 4: aaa 4 4 # <-- own `code`
# 5: bbb 5 5
# 6: ddd 6 6
# 7: aaa 7 7 # <-- own `code`
CodePudding user response:
merge full dataset with all lower-case subset, non-matches NAs get updated with existing code:
merge(x[, .(rn = 1:.N, entry_low = tolower(entry), entry, code) ],
x[ tolower(entry) == entry, .(entry_low = entry, code_low = code)],
by = "entry_low", all.x = TRUE
)[ is.na(code_low), code_low := code
][ order(rn), .(entry, code, code_low)]
# entry code code_low
# 1: Aaa 1 4
# 2: Bbb 2 5
# 3: Ccc 3 3
# 4: aaa 4 4
# 5: bbb 5 5
# 6: ddd 6 6
CodePudding user response:
If you want to keep the original data.table untouched this might help.
- first generate the
tolower
grouping entries, - then select the corresponding code values and
- last put code for
NA
values if no grouping was found.
x[, .(code, entry, l_entry = tolower(entry))
][, .(entry, l = last(code[l_entry == entry]), code), by = l_entry
][, .(entry, code, low_code = ifelse(is.na(l), code, l))]
entry code low_code
1: Aaa 1 4
2: aaa 4 4
3: Bbb 2 5
4: bbb 5 5
5: Ccc 3 3
6: ddd 6 6
If more lower entries (e.g. aaa) with different values are present it takes the last
one. Can be changed to first
, too.