I have dataframe (df table below): Every user can post in any category. I have to calculate HOW MANY DISTINCT USERS has a post in category A and at the same time has posts in categories, B, C and D. Table like:
User | Category |
---|---|
1 | A |
1 | B |
33 | B |
33 | C |
33 | D |
54 | A |
54 | B |
87 | A |
87 | B |
87 | C |
87 | D |
Total table is 950,000 rows It either gives Nan results in pivot (variant 1) or memory overflow (variant 2). Necessary result is
Category | A | B | C | D |
---|---|---|---|---|
A | 3 | 3 | 1 | 1 |
B | 3 | 4 | 2 | 1 |
C | 1 | 2 | 2 | 2 |
D | 1 | 1 | 2 | 2 |
Diagonal: A-B-C-D - total posted in each category, lower and upper triangles are the same and are intersections of how many users posted in, for instance, B and C.
I have tried
df.pivot_table(values=['user','category'],index=['category'], columns=['category'],aggfunc=np.count_nonzero)
a = df.values
d = {(i, j): np.mean(a[:, i] == a[:, j]) for i, j in combinations(range(a.shape[1]), 2)}
res, c, vals = np.zeros((a.shape[1], a.shape[1])),
list(map(list, zip(*d.keys()))), list(d.values())
res[c[0], c[1]] = vals
res_df = pd.DataFrame(res, columns=df.columns, index=df.columns)
Would yo please help with ideas. Thank you.
CodePudding user response:
One idea is use merge
with crosstab
like:
df = df.merge(df, on='User')
df = pd.crosstab(df['Category_x'], df['Category_y'])
print (df)
Category_y A B C D
Category_x
A 3 3 1 1
B 3 4 2 2
C 1 2 2 2
D 1 2 2 2