How find and return rows of pandas dataframe with unique values?-CodePudding

Say I have the following dataframe:

import pandas as pd

series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), 
          ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), 
          ('Stranger Things', 3, 'Todd'),
          ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
df = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

I am looking for a way to create a new dataframe, or even a list, that tells me the non-unique combinations of values between 'Name' and 'Actor'.

In this example, I would like to get as a result:

Stranger Things, 3, Millie
Stranger Things, 3, Todd

I have tried the sort(), unique(), and distinct() methods without success. Unique always seems to drop the column that I am not querying on (in this case, season).

Any help is appreciated!

CodePudding user response：

Do you need groupby with nunique?

df[df.groupby('Name')['Actor'].transform('nunique').gt(1)]

              Name  Seasons   Actor
0  Stranger Things        3  Millie
4  Stranger Things        3    Todd

CodePudding user response：

This will return a dataframe with those two rows that you show in your post:

actor_cts = df.drop_duplicates(subset=['Name','Actor']).groupby("Name")['Actor'].count()
df[df.Name.isin(actor_cts[actor_cts > 1].index)].reset_index(drop=True)

#              Name Seasons  Actor
# 0 Stranger Things       3 Millie
# 1 Stranger Things       3   Todd

CodePudding user response：

this is what I found based on your code! you could change the check_if_similar function to go through all of the data frames Keys too but unfortunately, I Don't have the time to put all of that together. If you have any questions feel free to ask me.

import pandas as pd


def check_if_similar(things):
    similar = []
    num1 = 0
    for _ in things:
        num2 = 0
        for _ in things:
            if num1 != num2:
                if things[num1] == things[num2] and num1 > num2:
                    similar.append([num1, num2])
            num2  = 1
        num1  = 1
    return similar


series = [('Stranger Things', 3, 'Millie'),
          ('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
          ('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Todd'),
          ('La Casa De Papel', 4, 'Sergio')]

# Create a DataFrame object
df = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])

info = check_if_similar(df['Name'].to_list())

for coords in info:
    if str(df.values[coords[0]]) != str(df.values[coords[1]]):
        print(str(df.values[coords[0]])   "\n"   str(df.values[coords[1]]))
    else:
        print(str(df.values[coords[0]])   " == "   str(df.values[coords[1]]))