Home > OS >  creating rows for several one hot encoded columns (all combinations) to be scored by model
creating rows for several one hot encoded columns (all combinations) to be scored by model

Time:02-20

I start of with my wants with this simplified example:

data = {'dg1_1':[1, 0],
        'dg1_2':[0, 1],
        'dg2_1':[0, 1],
        'dg2_2':[1, 0],
        'cont1':[13.0, 13.0]}
wants = pd.DataFrame(data)

enter image description here

I do not really have this and this is meant to be generated. I have 2 one hot encoded groups dg1 and dg2. This is obviously simplified and dg1 and dg2 can contain different number of columns. From some observations (a sample) I can get them also like this:

dg1_indeces = observations.columns[wants.columns.str.startswith("dg1")]
dg2_indeces = observations.columns[wants.columns.str.startswith("dg2")]

Given one observation (ab)using my wants to explain:

one_observation = wants.head(1)

I want to create all possibly combinations given one_observation so that for each encoded group, I only turn on one column in each "one hot encoded group" at the time. So I can do:

haves = pd.concat([haves]*(len(dg1_indeces) * len(dg2_indeces)), ignore_index=True)
haves.loc[:, dg1_indeces] = 0
haves.loc[:, dg2_indeces] = 0
print(haves)

This gives me all rows with the hot encoded groups all zero - I now want to get to my wants (see at the top) in the most efficient way. I guess avoiding loops to then score the data using an existing model. Hope this makes sense?

PS:

This my naïve way of possibly achieving this:

row = 0
for dg1 in dg1_indeces:  
    for dg2 in dg2_indeces:
        haves.loc[row, dg1] = 1
        haves.loc[row, dg2] = 1
        row  = 1 

PPS:

observation = wants.head(1)
haves = observation.drop(dg1_indeces, axis = 1)
haves = observation.drop(dg2_indeces, axis = 1)
haves = pd.concat([haves]*(len(dg1_indeces) * len(dg2_indeces)), ignore_index=True)
haves

idx = pd.MultiIndex.from_product([dg1_indeces, dg2_indeces]).map('|'.join)
combinations = pd.Series(idx).str.get_dummies('|')

haves = [haves, combinations]
haves = haves.reindex(columns=observation.columns)
haves

CodePudding user response:

You can build from bottom with pd.MultiIndex.from_product or merge with cross

s1 = df.columns[df.columns.str.startswith('dg1')]
s2 = df.columns[df.columns.str.startswith('dg2')]
#if s1 and s2 is dataframe idx = s1.merge(s2,how='cross')
idx = pd.MultiIndex.from_product([s1,s2]).map('|'.join)
pd.Series(idx).str.get_dummies('|')
Out[115]: 
   dg1_1  dg1_2  dg2_1  dg2_2
0      1      0      1      0
1      1      0      0      1
2      0      1      1      0
3      0      1      0      1

CodePudding user response:

Let's add a third attribute to the dg2 group and change the cont1 value of the second row to make things less confusing:

data = {'dg1_1':[1, 0],
        'dg1_2':[0, 1],
        'dg2_1':[0, 1],
        'dg2_2':[1, 0],
        'dg2_3':[0, 0],
        'cont1':[13.0, 14.0]}
wants = pd.DataFrame(data)

So now you have 2 groups, one with 2 attributes and one with 3 attributes. Only one attribute can be "hot" per group. If we lay out a 2 x 3 matrix and fill each cell with 2 ** (i,j):

   0       1       2
0  (1, 1)  (1, 2)  (1, 4)
1  (2, 1)  (2, 2)  (2, 4)

Then convert the matrix to binary:

   0           1         2
0  (01, 001)  (01, 010)  (01, 100)
1  (10, 001)  (10, 010)  (10, 100)

It essentially satisfies our requirement that only one attribute per group is "hot". If you unravel (i.e. flatten) it:

dg1  dg2
01   001
01   010
01   100
10   001
10   010
10   100

It becomes the list of permutations that you can cross join against every observation.


# Get the columns were are interested in
cols = wants.columns[wants.columns.str.startswith("dg")].to_series()

# shape is an (n1, n2, n3, ...) tuple where n_i is the number of attribute per group
shape = cols.str.split("_", expand=True).groupby(0).size().to_numpy()
rows = []

# Make the matrix
for i in range(shape.prod()):
    string = ''
    for dim, index in enumerate(np.unravel_index(i, shape)):
        string  = bin(2 ** index)[2:].zfill(shape[dim])
    rows.append(map(int, list(string)))

permutations = pd.DataFrame(rows, columns=cols)

# Result
wants[["cont1"]].merge(permutations, how="cross")
  • Related