creating rows for several one hot encoded columns (all combinations) to be scored by model-CodePudding

I start of with my wants with this simplified example:

data = {'dg1_1':[1, 0],
        'dg1_2':[0, 1],
        'dg2_1':[0, 1],
        'dg2_2':[1, 0],
        'cont1':[13.0, 13.0]}
wants = pd.DataFrame(data)

I do not really have this and this is meant to be generated. I have 2 one hot encoded groups dg1 and dg2. This is obviously simplified and dg1 and dg2 can contain different number of columns. From some observations (a sample) I can get them also like this:

dg1_indeces = observations.columns[wants.columns.str.startswith("dg1")]
dg2_indeces = observations.columns[wants.columns.str.startswith("dg2")]

Given one observation (ab)using my wants to explain:

one_observation = wants.head(1)

I want to create all possibly combinations given one_observation so that for each encoded group, I only turn on one column in each "one hot encoded group" at the time. So I can do:

haves = pd.concat([haves]*(len(dg1_indeces) * len(dg2_indeces)), ignore_index=True)
haves.loc[:, dg1_indeces] = 0
haves.loc[:, dg2_indeces] = 0
print(haves)

This gives me all rows with the hot encoded groups all zero - I now want to get to my wants (see at the top) in the most efficient way. I guess avoiding loops to then score the data using an existing model. Hope this makes sense?

PS:

This my naïve way of possibly achieving this:

row = 0
for dg1 in dg1_indeces:  
    for dg2 in dg2_indeces:
        haves.loc[row, dg1] = 1
        haves.loc[row, dg2] = 1
        row  = 1

PPS:

observation = wants.head(1)
haves = observation.drop(dg1_indeces, axis = 1)
haves = observation.drop(dg2_indeces, axis = 1)
haves = pd.concat([haves]*(len(dg1_indeces) * len(dg2_indeces)), ignore_index=True)
haves

idx = pd.MultiIndex.from_product([dg1_indeces, dg2_indeces]).map('|'.join)
combinations = pd.Series(idx).str.get_dummies('|')

haves = [haves, combinations]
haves = haves.reindex(columns=observation.columns)
haves

CodePudding user response：

You can build from bottom with pd.MultiIndex.from_product or merge with cross

s1 = df.columns[df.columns.str.startswith('dg1')]
s2 = df.columns[df.columns.str.startswith('dg2')]
#if s1 and s2 is dataframe idx = s1.merge(s2,how='cross')
idx = pd.MultiIndex.from_product([s1,s2]).map('|'.join)
pd.Series(idx).str.get_dummies('|')
Out[115]: 
   dg1_1  dg1_2  dg2_1  dg2_2
0      1      0      1      0
1      1      0      0      1
2      0      1      1      0
3      0      1      0      1

CodePudding user response：

Let's add a third attribute to the dg2 group and change the cont1 value of the second row to make things less confusing:

data = {'dg1_1':[1, 0],
        'dg1_2':[0, 1],
        'dg2_1':[0, 1],
        'dg2_2':[1, 0],
        'dg2_3':[0, 0],
        'cont1':[13.0, 14.0]}
wants = pd.DataFrame(data)

So now you have 2 groups, one with 2 attributes and one with 3 attributes. Only one attribute can be "hot" per group. If we lay out a 2 x 3 matrix and fill each cell with 2 ** (i,j):

   0       1       2
0  (1, 1)  (1, 2)  (1, 4)
1  (2, 1)  (2, 2)  (2, 4)

Then convert the matrix to binary:

   0           1         2
0  (01, 001)  (01, 010)  (01, 100)
1  (10, 001)  (10, 010)  (10, 100)

It essentially satisfies our requirement that only one attribute per group is "hot". If you unravel (i.e. flatten) it:

It becomes the list of permutations that you can cross join against every observation.

# Get the columns were are interested in
cols = wants.columns[wants.columns.str.startswith("dg")].to_series()

# shape is an (n1, n2, n3, ...) tuple where n_i is the number of attribute per group
shape = cols.str.split("_", expand=True).groupby(0).size().to_numpy()
rows = []

# Make the matrix
for i in range(shape.prod()):
    string = ''
    for dim, index in enumerate(np.unravel_index(i, shape)):
        string  = bin(2 ** index)[2:].zfill(shape[dim])
    rows.append(map(int, list(string)))

permutations = pd.DataFrame(rows, columns=cols)

# Result
wants[["cont1"]].merge(permutations, how="cross")