Home > Enterprise >  Pandas - find rows sharing two out the three common values, order-independent, and collect values pa
Pandas - find rows sharing two out the three common values, order-independent, and collect values pa

Time:12-24

Given a dataframe, I am looking for rows where two out of three values are in common, regardless of the columns, hence order, in which they appear. I would like to then collect those common pairs. Please note

  • a couple of values can appear at most in two rows
  • a value can appear only once in a row

I would like to know what the most efficient/elegant way is in numpy or pandas to solve this problem.

For example, taking as input the dataframe

d = {'col1': [1, 2,5,1], 'col2': [1, 7,1,2],'col3': [3, 3,1,7]}
df = pd.DataFrame(data=d)
    col1    col2    col3
0   1   2   3
1   2   7   3
2   5   1   2
3   9   2   7

I expect as result an array, list, something as

1  2
2  3
2  7

as the values (1,2) , (2,3) and (2,7) are present in two rows (first and third, first and second, and second and forth respectively).

I cannot find a concise solution. At the moment I skecthed a numpy solution such as

def func(x):
    rows, columns = x.shape[0], x.shape[1]
    res = []
    for i in range(0,rows):
        for j in range(i 1, rows):
            aux = np.intersect1d(x[i,:], x[j,:])
            if aux.size>1:
                res.append(aux)
    return res

which outputs

func(df.values)

Out: [array([2, 3]), array([1, 2]), array([2, 7])]

It looks well cumbersome, how could get it done with one of those cool numpy/pandas one-liners?

CodePudding user response:

I would suggest using python built in set operations to do most of the heavy lifting, just apply them with pandas:

import itertools
import pandas as pd

d = {'col1': [1, 2,5,9], 'col2': [2, 7,1,2],'col3': [3, 3,2,7]}
df = pd.DataFrame(data=d)

pairs = df.apply(set, axis=1).apply(lambda x: set(itertools.combinations(x, 2))).explode()
out = set(pairs[pairs.duplicated()])

Output:

{(2, 3), (1, 2), (2, 7)}

Optionally to get it in list[np.ndarray] format:

out = list(map(np.array, out))

CodePudding user response:

Similar approach to that of @Chrysophylaxs but in pure python:

from itertools import combinations
from collections import Counter

c = Counter(s for x in df.to_numpy().tolist() for s in set(combinations(set(x), r=2)))

out = [k for k,v in c.items() if v>1]
# [(2, 3), (1, 2), (2, 7)]
  • Related