Home > other >  pairwise comparison of rows in pandas DataFrame
pairwise comparison of rows in pandas DataFrame

Time:08-18

I am trying to replicate this solution, but I have a numpy error.

As explained in the SO link i posted above, what I'd like to do, is to have the multi-index df populated with the pairwise comparison of the columns. For example, given the following dataframe:

    name    id
0   john    1a
1   john    1a
2   mary    2b
3   mary    3c

I would like a resulting df whose first 4 rows should be:

        name    id
0   0   True    True
    1   True    True
    2   False   False
    3   False   False
...
# create dummy data

d = {'name': ["john", "john", "mary", "mary"], 'id': ["1a", "1a", "2b", "3c"]}
df = pd.DataFrame(data=d)

# create target df to be populated

result = pd.DataFrame(columns=["name", "id"],
                      index=pd.MultiIndex.from_product([df.index, df.index]))

till here all good. A df is created with null everywhere, and a multi-index index.

but when i run this:

result["name"] = np.equal.outer(result["name"], result["name"]).ravel()

I get this error:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-38-39cdf1e8944e> in <module>
----> 1 np.equal.outer(result["name"], result["name"]).ravel()

2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/arraylike.py in reconstruct(result)
    332                     warnings.warn(msg.format(ufunc), FutureWarning, stacklevel=4)
    333                     return result
--> 334                 raise NotImplementedError
    335             return result
    336         if isinstance(result, BlockManager):

NotImplementedError: 

If i slice the command to see which one is the part that causes the error, it seems to be the outer method:

np.outer(df["name"], df["name"])

yields:

TypeError                                 Traceback (most recent call last)
<ipython-input-40-e949bd4d0a76> in <module>
----> 1 np.outer(df["name"], df["name"])

<__array_function__ internals> in outer(*args, **kwargs)

/usr/local/lib/python3.7/dist-packages/numpy/core/numeric.py in outer(a, b, out)
    934     a = asarray(a)
    935     b = asarray(b)
--> 936     return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis, :], out)
    937 
    938 

TypeError: can't multiply sequence by non-int of type 'str'

CodePudding user response:

To make this work, it seems you need to use df['name'].values rather than just df['name']. So:

import pandas as pd
import numpy as np

d = {'name': ["john", "john", "mary", "mary"], 'id': ["1a", "1a", "2b", "3c"]}
df = pd.DataFrame(data=d)

result = pd.DataFrame(columns=["name", "id"],
                      index=pd.MultiIndex.from_product([df.index, df.index]))

outer = df.apply(lambda x: np.equal.outer(x.values, x.values).ravel(), axis=0)

result.loc[:,['name','id']] = outer.values
print(result)

      name     id
0 0   True   True
  1   True   True
  2  False  False
  3  False  False
1 0   True   True
  1   True   True
  2  False  False
  3  False  False
2 0  False  False
  1  False  False
  2   True   True
  3   True  False
3 0  False  False
  1  False  False
  2   True  False
  3   True   True
  • Related