Home > Net >  filter on value set intersection within a group
filter on value set intersection within a group

Time:07-15

Let's say I have a dataframe as follows:

Group  |  Source  |  Name
___________________________
  A    |    X     |  Jolly
  A    |    X     |  Stone
  A    |    X     |  Jolly
  A    |    Y     |  Sand
  B    |    X     |  Sand
  B    |    X     |  Stone
  B    |    Y     |  Stone
  C    |    X     |  Sand
  C    |    X     |  Stone

I want to find all Groups where each group of Sources share no common Names. Essentially in the example above, I want Group A as all groupings of Source (X and Y) Names share no common values. For this example we can assume there will only be 2 Sources (X and Y) and not all Groups have more than 1 Source. I am only interested in Groups with both Sources X and Y and no Name intersection.

The resulting DataFrame should look like this:

Group  |  Source  |  Name
___________________________
  A    |    X     |  Jolly
  A    |    X     |  Stone
  A    |    X     |  Jolly
  A    |    Y     |  Sand

I have tried, doing a grouby on Group then supplied a function to the chained filter method like so:

def find_no_intersection(df):
    return (
        len(df[df.Source == 'X'].Name.values) > 0 and 
        len(df[df.Source == 'Y'].Name.values) > 0 and 
        (
            len(
                set(df[df.Source == 'X'].Name.values) & 
                set(df[df.Source == 'Y'].Name.values)
            ) == 0
        )
    )

df.groupby(['Group']).filter(find_no_intersection)

Is this the right way? Is there a better way?

CodePudding user response:

Here is a way using nunique()

df.loc[df.groupby('Group')['Name'].transform(lambda x: x.size == x.nunique())]

Output:

  Group Source   Name
0     A      X  Jolly
1     A      X  Stone
2     A      Y   Sand

Update to answer:

(df.loc[
    df['Group'].map(
        df.groupby(['Group','Source'])['Name']
        .agg(set)
        .groupby(level=0)
        .agg(lambda x: len(set.intersection(*x))==0))
        ])

Output:

  Group Source   Name
0     A      X  Jolly
1     A      X  Stone
2     A      X  Jolly
3     A      Y   Sand

CodePudding user response:

If I understand correctly, you can do that with the following.

df[~df['Group'].isin(df[df[['Source','Name']].duplicated()]['Group'])]
  • Related