I have a Pandas dataframe df
with two columns, A
and B
such that all values in B
would be different for a value in A
, but two different values in A
can have duplicate values in B
.
A B
ABC1 XYZ XYZ 123
ABC2 XYZ XYZ 123
ABC1 XYZ XYZ 135
ABC3 XYZ PQR 123
ABC2 XYZ PQR 123
ABC3 XYZ XYZ 135
ABC4 XYZ XYZ 135
ABC2 XYZ PQR 987
ABC4 XYZ PQR 123
ABC5 PQR PQR 567
How can I create a heatmap from this dataframe that shows me the number of common B
values between any two A
values? I presume I'll have to create a pivot that might look something like this -
ABC1 ABC2 ABC3 ABC4 ABC5
ABC1 2 1 ...
ABC2 1 3
ABC3 0 1 ...
ABC4 0 0 ... ...
ABC5 0 0
Any help on how this can be done efficiently?
CodePudding user response:
I think we can use get_dummies
and np.tensordot
for this.
First, let’s find out for each value of A
which B
are present
>>> common = pd.get_dummies(df['A']).groupby(df['B']).max()
>>> common
Bar Foo
B
2 1 0
3 0 1
4 0 1
5 1 1
6 1 0
9 1 0
Then to compute the intersections of each rows we need the dot product of every pair of rows, for that we can use:
>>> np.tensordot(common, common, (0, 0))
array([[4, 1],
[1, 3]], dtype=uint8)
>>> pd.DataFrame(np.tensordot(common, common, (0, 0)), columns=common.columns, index=common.columns)
Bar Foo
Bar 4 1
Foo 1 3
The diagonal shows the number of unique B
values for each A
column.