I am trying to replicate this solution, but I have a numpy error.
As explained in the SO link i posted above, what I'd like to do, is to have the multi-index df populated with the pairwise comparison of the columns. For example, given the following dataframe:
name id
0 john 1a
1 john 1a
2 mary 2b
3 mary 3c
I would like a resulting df whose first 4 rows should be:
name id
0 0 True True
1 True True
2 False False
3 False False
...
# create dummy data
d = {'name': ["john", "john", "mary", "mary"], 'id': ["1a", "1a", "2b", "3c"]}
df = pd.DataFrame(data=d)
# create target df to be populated
result = pd.DataFrame(columns=["name", "id"],
index=pd.MultiIndex.from_product([df.index, df.index]))
till here all good. A df is created with null everywhere, and a multi-index index.
but when i run this:
result["name"] = np.equal.outer(result["name"], result["name"]).ravel()
I get this error:
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-38-39cdf1e8944e> in <module>
----> 1 np.equal.outer(result["name"], result["name"]).ravel()
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/arraylike.py in reconstruct(result)
332 warnings.warn(msg.format(ufunc), FutureWarning, stacklevel=4)
333 return result
--> 334 raise NotImplementedError
335 return result
336 if isinstance(result, BlockManager):
NotImplementedError:
If i slice the command to see which one is the part that causes the error, it seems to be the outer
method:
np.outer(df["name"], df["name"])
yields:
TypeError Traceback (most recent call last)
<ipython-input-40-e949bd4d0a76> in <module>
----> 1 np.outer(df["name"], df["name"])
<__array_function__ internals> in outer(*args, **kwargs)
/usr/local/lib/python3.7/dist-packages/numpy/core/numeric.py in outer(a, b, out)
934 a = asarray(a)
935 b = asarray(b)
--> 936 return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
937
938
TypeError: can't multiply sequence by non-int of type 'str'
CodePudding user response:
To make this work, it seems you need to use df['name'].values
rather than just df['name']
. So:
import pandas as pd
import numpy as np
d = {'name': ["john", "john", "mary", "mary"], 'id': ["1a", "1a", "2b", "3c"]}
df = pd.DataFrame(data=d)
result = pd.DataFrame(columns=["name", "id"],
index=pd.MultiIndex.from_product([df.index, df.index]))
outer = df.apply(lambda x: np.equal.outer(x.values, x.values).ravel(), axis=0)
result.loc[:,['name','id']] = outer.values
print(result)
name id
0 0 True True
1 True True
2 False False
3 False False
1 0 True True
1 True True
2 False False
3 False False
2 0 False False
1 False False
2 True True
3 True False
3 0 False False
1 False False
2 True False
3 True True