Home > Software engineering >  datatable: process 2 frames
datatable: process 2 frames

Time:03-28

data_df = pd.DataFrame({"AAA": [1, 2, 1, 3],
                        "BBB": [1, 1, 2, 2],
                        "CCC": [2, 1, 3, 1]})

lookup_df = pd.DataFrame({"key": [1,2,3], "value" : ["Alpha", "Beta", "Charlie"]})

data_dt = dt.Frame(data_df)
lookup_dt = dt.Frame(lookup_df)

I have these 2 datatables; one containing the data and another containing the lookup values.

I am trying to produce a result datatable with additional columns with lookups done.

For example:

AAA BBB CCC AAA_Category BBB_Category CCC_Category
1   1   2   Alpha        Alpha        Beta

I can do that by converting the lookup into a list of tuples and doing something like

for name, cond, value in conditions:
   data_dt[cond, f"{name}_category"] = value

Is there anyway to achieve this is in a more native vector "datatable" way.

CodePudding user response:

You can set a key on lookup_dt, and then join, each time renaming the target column to key, and then renaming back; use dt.cbind to bind the resulting list of frames

lookup_dt.key = "key"

dt.cbind(
    [data_dt[:,{"key":f[c]}][:,:, join(lookup_dt)][:, {c:f.key,c "category":f.value}] for c in data_dt.names]
)

Output:

   |   AAA  AAAcategory    BBB  BBBcategory    CCC  CCCcategory
   | int64  str32        int64  str32        int64  str32      
--   -----  -----------  -----  -----------  -----  -----------
 0 |     1  Alpha            1  Alpha            2  Beta       
 1 |     2  Beta             1  Alpha            1  Alpha      
 2 |     1  Alpha            2  Beta             3  Charlie    
 3 |     3  Charlie          2  Beta             1  Alpha      
[4 rows x 6 columns]

Comment: The lack of a reshape long / reshape wide capability is one of the key missing elements of py-datatable, in my opinion. A natural approach to your problem is to make the data_dt long, and do a single merge with the lookup_dt, and then if desired, reshape the result back to wide format. This is done below, but is a workaround, given the lack of built-in reshape long / reshape wide:

# Set key
lookup_dt.key = "key"

# Reshape long and do a single join
long_dt = dt.rbind([data_dt[:,{"key":f[c], "cat":c}] for c in data_dt.names])[:,:,join(lookup_dt)]

# Reshape wide and rename to your desired column names
dt.cbind([long_dt[f.cat==c,{c:f.key,c "category":f.value}] for c in data_dt.names])

CodePudding user response:

You can avoid flipping from wide to long and back to wide for this particular scenario by taking advantage of the replace function, before recombining. This should be more performant, although it does involve a couple of steps, and me ripping into some of your existing data:

from datatable import dt, f
# ripping here
# if you can access a dictionary for your lookup
# which is a what a lookup is, then do so
key = [1,2,3]
value =  ["Alpha", "Beta", "Charlie"]
# the replace function is pretty strict
# you can only replace like type for type
# so string for string, int for int, ...
mapping = dict(zip(map(str, key), value))

# create another dataframe to munge
temp = data_dt.copy()
# convert to string, so that replace can work
temp = temp[:, dt.str32(f[:])] # you can also use as_type instead
# replace is an in place operation; no need for reassignment
temp.replace(mapping)
# change the column names :
temp.names = [f"{name}_category" for name in temp.names]
 
temp

   | AAA_category  BBB_category  CCC_category
   | str32         str32         str32
--   ------------  ------------  ------------
 0 | Alpha         Alpha         Beta
 1 | Beta          Alpha         Alpha
 2 | Alpha         Beta          Charlie
 3 | Charlie       Beta          Alpha
[4 rows x 3 columns]

Now, you can combine data_dt and temp with dt.cbind:

dt.cbind([data_dt, temp])

   |   AAA    BBB    CCC  AAA_category  BBB_category  CCC_category
   | int64  int64  int64  str32         str32         str32
--   -----  -----  -----  ------------  ------------  ------------
 0 |     1      1      2  Alpha         Alpha         Beta
 1 |     2      1      1  Beta          Alpha         Alpha
 2 |     1      2      3  Alpha         Beta          Charlie
 3 |     3      2      1  Charlie       Beta          Alpha
[4 rows x 6 columns]
  • Related