I start of with my wants with this simplified example:
data = {'dg1_1':[1, 0],
'dg1_2':[0, 1],
'dg2_1':[0, 1],
'dg2_2':[1, 0],
'cont1':[13.0, 13.0]}
wants = pd.DataFrame(data)
I do not really have this and this is meant to be generated. I have 2 one hot encoded groups dg1 and dg2. This is obviously simplified and dg1 and dg2 can contain different number of columns. From some observations (a sample) I can get them also like this:
dg1_indeces = observations.columns[wants.columns.str.startswith("dg1")]
dg2_indeces = observations.columns[wants.columns.str.startswith("dg2")]
Given one observation (ab)using my wants to explain:
one_observation = wants.head(1)
I want to create all possibly combinations given one_observation so that for each encoded group, I only turn on one column in each "one hot encoded group" at the time. So I can do:
haves = pd.concat([haves]*(len(dg1_indeces) * len(dg2_indeces)), ignore_index=True)
haves.loc[:, dg1_indeces] = 0
haves.loc[:, dg2_indeces] = 0
print(haves)
This gives me all rows with the hot encoded groups all zero - I now want to get to my wants (see at the top) in the most efficient way. I guess avoiding loops to then score the data using an existing model. Hope this makes sense?
PS:
This my naïve way of possibly achieving this:
row = 0
for dg1 in dg1_indeces:
for dg2 in dg2_indeces:
haves.loc[row, dg1] = 1
haves.loc[row, dg2] = 1
row = 1
PPS:
observation = wants.head(1)
haves = observation.drop(dg1_indeces, axis = 1)
haves = observation.drop(dg2_indeces, axis = 1)
haves = pd.concat([haves]*(len(dg1_indeces) * len(dg2_indeces)), ignore_index=True)
haves
idx = pd.MultiIndex.from_product([dg1_indeces, dg2_indeces]).map('|'.join)
combinations = pd.Series(idx).str.get_dummies('|')
haves = [haves, combinations]
haves = haves.reindex(columns=observation.columns)
haves
CodePudding user response:
You can build from bottom with pd.MultiIndex.from_product
or merge
with cross
s1 = df.columns[df.columns.str.startswith('dg1')]
s2 = df.columns[df.columns.str.startswith('dg2')]
#if s1 and s2 is dataframe idx = s1.merge(s2,how='cross')
idx = pd.MultiIndex.from_product([s1,s2]).map('|'.join)
pd.Series(idx).str.get_dummies('|')
Out[115]:
dg1_1 dg1_2 dg2_1 dg2_2
0 1 0 1 0
1 1 0 0 1
2 0 1 1 0
3 0 1 0 1
CodePudding user response:
Let's add a third attribute to the dg2
group and change the cont1
value of the second row to make things less confusing:
data = {'dg1_1':[1, 0],
'dg1_2':[0, 1],
'dg2_1':[0, 1],
'dg2_2':[1, 0],
'dg2_3':[0, 0],
'cont1':[13.0, 14.0]}
wants = pd.DataFrame(data)
So now you have 2 groups, one with 2 attributes and one with 3 attributes. Only one attribute can be "hot" per group. If we lay out a 2 x 3 matrix and fill each cell with 2 ** (i,j)
:
0 1 2
0 (1, 1) (1, 2) (1, 4)
1 (2, 1) (2, 2) (2, 4)
Then convert the matrix to binary:
0 1 2
0 (01, 001) (01, 010) (01, 100)
1 (10, 001) (10, 010) (10, 100)
It essentially satisfies our requirement that only one attribute per group is "hot". If you unravel (i.e. flatten) it:
dg1 dg2
01 001
01 010
01 100
10 001
10 010
10 100
It becomes the list of permutations that you can cross join against every observation.
# Get the columns were are interested in
cols = wants.columns[wants.columns.str.startswith("dg")].to_series()
# shape is an (n1, n2, n3, ...) tuple where n_i is the number of attribute per group
shape = cols.str.split("_", expand=True).groupby(0).size().to_numpy()
rows = []
# Make the matrix
for i in range(shape.prod()):
string = ''
for dim, index in enumerate(np.unravel_index(i, shape)):
string = bin(2 ** index)[2:].zfill(shape[dim])
rows.append(map(int, list(string)))
permutations = pd.DataFrame(rows, columns=cols)
# Result
wants[["cont1"]].merge(permutations, how="cross")