Home > Software design >  return a match more than 2 numbers that match side by side
return a match more than 2 numbers that match side by side

Time:01-01

I have data frame called df1 with 1 row with 6 numbers & another data frame called df2 with more than 500 rows with Ids and 6 numbers. In df1 I would like to lookup 6 numbers & find them in df2 and only return the matching numbers that's match more than 2 numbers side by side. It can be any 6 numbers in df1 as long it matches more than 2 numbers side by side. I created a small example below,

import pandas as pd 

df1 = pd.DataFrame([[2,4,6,8,9,10]], columns = 
['Num1','Num2','Num3','Num4','Num5','Num6'])


df2    = pd.DataFrame([[100,1,2,4,5,6,8],
                       [87,1,6,20,22,23,34],
                       [99,1,12,13,34,45,46],
                       [64,1,10,14,29,32,33],
                       [55,1,22,13,23,33,35],
                       [66,1,6,7,8,9,10],
                       [77,1,2,3,5,6,8],
                       [811,1,2,5,6,8,10], 
                       [118,1,7,8,22,44,56],
                       [117,1,66,44,47,87,91],
                       [299,2,4,7,20,21,22],
                       [187,3,6,10,12,25,39],
                       [199,4,12,24,34,56,57],
                       [264,3,7,8,9,10,33],
                       [50,6,8,10,23,33,35],
                       [212,4,6,12,18,19,20],
                       [45,3,7,23,35,56,88],
                       [801,1,2,4,6,28,39], 
                       [258,2,3,4,9,10,41],
                       [220,5,6,10,27,57,81]],
                       columns = ['Id', 'Num1','Num2','Num3','Num4','Num5','Num6'])

i would like my result to like this below.

result = pd.DataFrame([[66,1,6,7,8,9,10],
                        [811,1,2,5,6,8,10], 
                        [264,3,7,8,9,10,33],
                        [50,6,8,10,23,33,35],
                        [801,1,2,4,6,28,39], 
                        [258,2,3,4,9,10,41],
                        [220,4,6,10,27,57,81]],
                        columns = ['Id', 'Num1','Num2','Num3','Num4','Num5','Num6']) 

Why these numbers. Because the numbers match more than 2 numbers side by side

66, 8,9,10
811, 6,8,10         
264, 8,9,10         
50, 6,8,10
801, 2,4,6
258, 4,9,10
220, 4,6,10    

I also tried this code below but it only returns a match that has more than 2 but not side by side. Hopefully I'm making sense.

vals_to_find = set(df1.iloc[0])
mask = df2.loc[:, "Num1":].apply(lambda x: 
len(vals_to_find.intersection(x)) > 2, axis=1)
print(df2[mask])        

CodePudding user response:

I think this does exactly what you want:

import numpy as np
import pandas as pd

# Save the numbers in df1 in a list as their order does not matter
key = df1.T[0].to_list()

# Check to see how many of those numbers are in df2
temp = df2[df2.isin(key)]

I wrote a custom function that takes each row and checks if there are more than 2 consecutive numbers side by side. It uses an answer by jezrael which you can find here:

def side_counter(Num1, Num2, Num3, Num4, Num5, Num6):
    row = pd.Series([Num1, Num2, Num3, Num4, Num5, Num6])
    row = row.astype(float)
    m = row.isnull()
    s = m.cumsum()
    x = s.map(s[~m].value_counts()).ge(3) & ~m
    has_side = x.sum() # if the row has more than two side by side return more than 0
    if has_side == 0:
        return False
    else:
        return True

# Apply the method and save the result in a new column as boolean
df2["has_2sidebyside"] = temp.apply(lambda x: side_counter(x['Num1'], x["Num2"], x["Num3"], x["Num4"], x["Num5"], x["Num6"]), axis=1)

# Mask the dataframe based on the boolean column
result = df2[df2["has_2sidebyside"]==True].drop("has_2sidebyside", axis=1)
result

Id  Num1 Num2 Num3 Num4 Num5 Num6
5   66  1   6   7   8   9   10
7   811 1   2   5   6   8   10
13  264 3   7   8   9   10  33
14  50  6   8   10  23  33  35
17  801 1   2   4   6   28  39
18  258 2   3   4   9   10  41

CodePudding user response:

Here is a way:

(df2.loc[df2.loc[:,'Num1':].isin(np.ravel(df1))
.apply(lambda x: x.diff().ne(0).cumsum().where(x).value_counts(),axis=1)
.eq(3).any(axis=1)])

Output:

     Id  Num1  Num2  Num3  Num4  Num5  Num6
5    66     1     6     7     8     9    10
7   811     1     2     5     6     8    10
13  264     3     7     8     9    10    33
14   50     6     8    10    23    33    35
17  801     1     2     4     6    28    39
18  258     2     3     4     9    10    41
  • Related