Home > Blockchain >  Build matrix of dummy indicators
Build matrix of dummy indicators

Time:02-16

I have a pandas dataframe which looks like the following

team_id skill_id inventor_id
1 A Jack
1 B Jack
1 A Jill
1 B Jill
2 A Jack
2 B Jack
2 A Joe
2 B Joe

So inventors can repeat over teams. I want to turn this data frame into a matrix A (I have included column names below for clarity, they wouldn't form part of the matrix) of dummy indicators, for those example A =

Jack_A Jack_B Jill_A Jill_B Joe_A Joe_B
1 0 1 0 0 0
0 1 0 1 0 0
1 0 0 0 1 0
0 1 0 0 0 1

So that each row corresponds to one (team_id x skill_id combination), and each entry of the matrix is equal to one for that (inventor_id x skill_id) observation.

I tried to create an array of numpy zeros and thought of a double dictionary to map from each (team_id x skill), (inventor_id x skill) combination to an A_ij entry. However I believe this cannot be the most efficient method.

I need the method to be memory efficient as I have 220,000 (inventor x team x skill) observations. (So the dimension of the real df is (220,000, 3), not (8, 3) as in the example.

CodePudding user response:

IIUC, you can use crosstab:

print(
    pd.crosstab(
        index=[df['team_id'],df['skill_id']],
         columns=[df['inventor_id'], df['skill_id']]
    )#.to_numpy()
)
# inventor_id      Jack    Jill    Joe   
# skill_id            A  B    A  B   A  B
# team_id skill_id                       
# 1       A           1  0    1  0   0  0
#         B           0  1    0  1   0  0
# 2       A           1  0    0  0   1  0
#         B           0  1    0  0   0  1

and if you just want the matrix, then uncomment .to_numpy() in the above code.

Note: if you have some skills that are not shared between teams or inventors, you may need to reindex with all the possibilities, so do:

pd.crosstab(
    index=[df['team_id'],df['skill_id']],
    columns=[df['inventor_id'], df['skill_id']]
).reindex(
    index=pd.MultiIndex.from_product(
            [df['team_id'].unique(),df['skill_id'].unique()]),
    columns=pd.MultiIndex.from_product(
            [df['inventor_id'].unique(),df['skill_id'].unique()]),
    fill_value=0
)#.to_numpy()

CodePudding user response:

In addition to @Ben.T 's great answer I figured out another which allows me to keep memory efficient.

# Set the identifier for each row
inventor_data["team_id"] = inventor_data["team_id"].astype(str)

inventor_data["inv_skill_id"] = inventor_data["inventor_id"]      inventor_data["skill_id"]

inventor_data["team_skill_id"] = inventor_data["team_id"]   inventor_data["skill_id"]

# Using DictVectorizer requires a dictionary input
teams = list(inventor_data.groupby('team_skill_id')['inv_skill_id'].agg(dict))

# Change the dict entry from count to 1 
for team_id, team in enumerate(teams):

    teams[team_id] = {v: 1 for k, v in team.items()} 

from sklearn.feature_extraction import DictVectorizer

vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(teams)
  • Related