I have a pandas dataframe which looks like the following
team_id | skill_id | inventor_id |
---|---|---|
1 | A | Jack |
1 | B | Jack |
1 | A | Jill |
1 | B | Jill |
2 | A | Jack |
2 | B | Jack |
2 | A | Joe |
2 | B | Joe |
So inventors can repeat over teams. I want to turn this data frame into a matrix A (I have included column names below for clarity, they wouldn't form part of the matrix) of dummy indicators, for those example A =
Jack_A | Jack_B | Jill_A | Jill_B | Joe_A | Joe_B |
---|---|---|---|---|---|
1 | 0 | 1 | 0 | 0 | 0 |
0 | 1 | 0 | 1 | 0 | 0 |
1 | 0 | 0 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 0 | 1 |
So that each row corresponds to one (team_id x skill_id combination), and each entry of the matrix is equal to one for that (inventor_id x skill_id) observation.
I tried to create an array of numpy zeros and thought of a double dictionary to map from each (team_id x skill), (inventor_id x skill) combination to an A_ij entry. However I believe this cannot be the most efficient method.
I need the method to be memory efficient as I have 220,000 (inventor x team x skill) observations. (So the dimension of the real df is (220,000, 3), not (8, 3) as in the example.
CodePudding user response:
IIUC, you can use crosstab
:
print(
pd.crosstab(
index=[df['team_id'],df['skill_id']],
columns=[df['inventor_id'], df['skill_id']]
)#.to_numpy()
)
# inventor_id Jack Jill Joe
# skill_id A B A B A B
# team_id skill_id
# 1 A 1 0 1 0 0 0
# B 0 1 0 1 0 0
# 2 A 1 0 0 0 1 0
# B 0 1 0 0 0 1
and if you just want the matrix, then uncomment .to_numpy()
in the above code.
Note: if you have some skills that are not shared between teams or inventors, you may need to reindex
with all the possibilities, so do:
pd.crosstab(
index=[df['team_id'],df['skill_id']],
columns=[df['inventor_id'], df['skill_id']]
).reindex(
index=pd.MultiIndex.from_product(
[df['team_id'].unique(),df['skill_id'].unique()]),
columns=pd.MultiIndex.from_product(
[df['inventor_id'].unique(),df['skill_id'].unique()]),
fill_value=0
)#.to_numpy()
CodePudding user response:
In addition to @Ben.T 's great answer I figured out another which allows me to keep memory efficient.
# Set the identifier for each row
inventor_data["team_id"] = inventor_data["team_id"].astype(str)
inventor_data["inv_skill_id"] = inventor_data["inventor_id"] inventor_data["skill_id"]
inventor_data["team_skill_id"] = inventor_data["team_id"] inventor_data["skill_id"]
# Using DictVectorizer requires a dictionary input
teams = list(inventor_data.groupby('team_skill_id')['inv_skill_id'].agg(dict))
# Change the dict entry from count to 1
for team_id, team in enumerate(teams):
teams[team_id] = {v: 1 for k, v in team.items()}
from sklearn.feature_extraction import DictVectorizer
vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(teams)