data_df = pd.DataFrame({"AAA": [1, 2, 1, 3],
"BBB": [1, 1, 2, 2],
"CCC": [2, 1, 3, 1]})
lookup_df = pd.DataFrame({"key": [1,2,3], "value" : ["Alpha", "Beta", "Charlie"]})
data_dt = dt.Frame(data_df)
lookup_dt = dt.Frame(lookup_df)
I have these 2 datatables; one containing the data and another containing the lookup values.
I am trying to produce a result datatable with additional columns with lookups done.
For example:
AAA BBB CCC AAA_Category BBB_Category CCC_Category
1 1 2 Alpha Alpha Beta
I can do that by converting the lookup into a list of tuples and doing something like
for name, cond, value in conditions:
data_dt[cond, f"{name}_category"] = value
Is there anyway to achieve this is in a more native vector "datatable
" way.
CodePudding user response:
You can set a key on lookup_dt
, and then join, each time renaming the target column to key
, and then renaming back; use dt.cbind
to bind the resulting list of frames
lookup_dt.key = "key"
dt.cbind(
[data_dt[:,{"key":f[c]}][:,:, join(lookup_dt)][:, {c:f.key,c "category":f.value}] for c in data_dt.names]
)
Output:
| AAA AAAcategory BBB BBBcategory CCC CCCcategory
| int64 str32 int64 str32 int64 str32
-- ----- ----------- ----- ----------- ----- -----------
0 | 1 Alpha 1 Alpha 2 Beta
1 | 2 Beta 1 Alpha 1 Alpha
2 | 1 Alpha 2 Beta 3 Charlie
3 | 3 Charlie 2 Beta 1 Alpha
[4 rows x 6 columns]
Comment: The lack of a reshape long / reshape wide capability is one of the key missing elements of py-datatable
, in my opinion. A natural approach to your problem is to make the data_dt
long, and do a single merge with the lookup_dt
, and then if desired, reshape the result back to wide format. This is done below, but is a workaround, given the lack of built-in reshape long / reshape wide:
# Set key
lookup_dt.key = "key"
# Reshape long and do a single join
long_dt = dt.rbind([data_dt[:,{"key":f[c], "cat":c}] for c in data_dt.names])[:,:,join(lookup_dt)]
# Reshape wide and rename to your desired column names
dt.cbind([long_dt[f.cat==c,{c:f.key,c "category":f.value}] for c in data_dt.names])
CodePudding user response:
You can avoid flipping from wide to long and back to wide for this particular scenario by taking advantage of the replace function, before recombining. This should be more performant, although it does involve a couple of steps, and me ripping into some of your existing data:
from datatable import dt, f
# ripping here
# if you can access a dictionary for your lookup
# which is a what a lookup is, then do so
key = [1,2,3]
value = ["Alpha", "Beta", "Charlie"]
# the replace function is pretty strict
# you can only replace like type for type
# so string for string, int for int, ...
mapping = dict(zip(map(str, key), value))
# create another dataframe to munge
temp = data_dt.copy()
# convert to string, so that replace can work
temp = temp[:, dt.str32(f[:])] # you can also use as_type instead
# replace is an in place operation; no need for reassignment
temp.replace(mapping)
# change the column names :
temp.names = [f"{name}_category" for name in temp.names]
temp
| AAA_category BBB_category CCC_category
| str32 str32 str32
-- ------------ ------------ ------------
0 | Alpha Alpha Beta
1 | Beta Alpha Alpha
2 | Alpha Beta Charlie
3 | Charlie Beta Alpha
[4 rows x 3 columns]
Now, you can combine data_dt
and temp
with dt.cbind
:
dt.cbind([data_dt, temp])
| AAA BBB CCC AAA_category BBB_category CCC_category
| int64 int64 int64 str32 str32 str32
-- ----- ----- ----- ------------ ------------ ------------
0 | 1 1 2 Alpha Alpha Beta
1 | 2 1 1 Beta Alpha Alpha
2 | 1 2 3 Alpha Beta Charlie
3 | 3 2 1 Charlie Beta Alpha
[4 rows x 6 columns]