Home > Blockchain >  Find intersections in all dataset rows
Find intersections in all dataset rows

Time:06-21

I need to write a function.

It takes any value from the dataset as input and should look for an intersection in all rows.

For example: phone = 87778885566

The table is represented by the following fields:

  1. key
  2. id
  3. phone
  4. email

Test data:

The output should be:

It should check all values ​​and if values ​​intersect somewhere, return a dataset with intersections.

CodePudding user response:

I didn't get the idea, how do you want to get intersections by phone number in key or email rows from your description, so I created two functions:

import pandas as pd

#you have to pass your dataframe and keywords in list you want to intersect
def get_intersections(data: pd.DataFrame, kw: list):
    values = data.to_numpy()
    intersected_data = []
    for i in values:
        if set(kw).intersection(i):
            intersected_data.append(tuple(i))
    return pd.DataFrame(set(intersected_data), columns=data.columns)

df >>
  key     id        phone            email
0   1  12345  89997776655   [email protected]
1   2  54321  87778885566    [email protected]
2   3  98765  87776664577  [email protected]
3   4  66678  87778885566   [email protected]
4   5  34567  84547895566   [email protected]
5   6  34567  89087545678   [email protected]


get_intersections(df,['87778885566','[email protected]']).sort_values(by='key').reset_index(drop=True)
>>
  key     id        phone           email
0   2  54321  87778885566   [email protected]
1   4  66678  87778885566  [email protected]
2   5  34567  84547895566  [email protected]

Another function search intersections row by row in your dataframe:

def get_intersections(data):
    values = data.to_numpy()
    intersected_data = []
    for i in values:
        for j in values:
            if set(i) != set(j) and set(i).intersection(j):
                intersected_data.append(tuple(i))
    return pd.DataFrame(set(intersected_data), columns=data.columns)

get_intersections(df).sort_values(by='key').reset_index(drop=True)
>>
  key     id        phone           email
0   2  54321  87778885566   [email protected]
1   4  66678  87778885566  [email protected]
2   5  34567  84547895566  [email protected]
3   6  34567  89087545678  [email protected]

CodePudding user response:

You could use recurssion:

import numpy as np

def relation(dat, values):
    d = dat.apply(lambda x: x.isin(values.ravel()))
    values1 = dat.iloc[np.unique(np.where(d)[0]),:]
    if set(np.array(values)) == set(values1.to_numpy().ravel()):
        return values1
    else:
        return relation(dat, values1.to_numpy().ravel())

relation(df.astype(str), np.array(['87778885566']))

       1            2               3
1  54321  87778885566   [email protected]
3  66678  87778885566  [email protected]
4  34567  84547895566  [email protected]
5  34567  89087545678  [email protected]
  • Related