Build matrix of dummy indicators-CodePudding

I have a pandas dataframe which looks like the following

team_id	skill_id	inventor_id
1	A	Jack
1	B	Jack
1	A	Jill
1	B	Jill
2	A	Jack
2	B	Jack
2	A	Joe
2	B	Joe

So inventors can repeat over teams. I want to turn this data frame into a matrix A (I have included column names below for clarity, they wouldn't form part of the matrix) of dummy indicators, for those example A =

Jack_A	Jack_B	Jill_A	Jill_B	Joe_A	Joe_B
1	0	1	0	0	0
0	1	0	1	0	0
1	0	0	0	1	0
0	1	0	0	0	1

So that each row corresponds to one (team_id x skill_id combination), and each entry of the matrix is equal to one for that (inventor_id x skill_id) observation.

I tried to create an array of numpy zeros and thought of a double dictionary to map from each (team_id x skill), (inventor_id x skill) combination to an A_ij entry. However I believe this cannot be the most efficient method.

I need the method to be memory efficient as I have 220,000 (inventor x team x skill) observations. (So the dimension of the real df is (220,000, 3), not (8, 3) as in the example.

CodePudding user response：

IIUC, you can use crosstab:

print(
    pd.crosstab(
        index=[df['team_id'],df['skill_id']],
         columns=[df['inventor_id'], df['skill_id']]
    )#.to_numpy()
)
# inventor_id      Jack    Jill    Joe   
# skill_id            A  B    A  B   A  B
# team_id skill_id                       
# 1       A           1  0    1  0   0  0
#         B           0  1    0  1   0  0
# 2       A           1  0    0  0   1  0
#         B           0  1    0  0   0  1

and if you just want the matrix, then uncomment .to_numpy() in the above code.

Note: if you have some skills that are not shared between teams or inventors, you may need to reindex with all the possibilities, so do:

pd.crosstab(
    index=[df['team_id'],df['skill_id']],
    columns=[df['inventor_id'], df['skill_id']]
).reindex(
    index=pd.MultiIndex.from_product(
            [df['team_id'].unique(),df['skill_id'].unique()]),
    columns=pd.MultiIndex.from_product(
            [df['inventor_id'].unique(),df['skill_id'].unique()]),
    fill_value=0
)#.to_numpy()

CodePudding user response：

In addition to @Ben.T 's great answer I figured out another which allows me to keep memory efficient.

# Set the identifier for each row
inventor_data["team_id"] = inventor_data["team_id"].astype(str)

inventor_data["inv_skill_id"] = inventor_data["inventor_id"]      inventor_data["skill_id"]

inventor_data["team_skill_id"] = inventor_data["team_id"]   inventor_data["skill_id"]

# Using DictVectorizer requires a dictionary input
teams = list(inventor_data.groupby('team_skill_id')['inv_skill_id'].agg(dict))

# Change the dict entry from count to 1 
for team_id, team in enumerate(teams):

    teams[team_id] = {v: 1 for k, v in team.items()} 

from sklearn.feature_extraction import DictVectorizer

vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(teams)