Say I have the following dataframe:
import pandas as pd
series = [('Stranger Things', 3, 'Millie'),
('Game of Thrones', 8, 'Emilia'),
('La Casa De Papel', 4, 'Sergio'),
('Westworld', 3, 'Evan Rachel'),
('Stranger Things', 3, 'Todd'),
('La Casa De Papel', 4, 'Sergio')]
# Create a DataFrame object
df = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
I am looking for a way to create a new dataframe, or even a list, that tells me the non-unique combinations of values between 'Name' and 'Actor'.
In this example, I would like to get as a result:
Stranger Things, 3, Millie
Stranger Things, 3, Todd
I have tried the sort(), unique(), and distinct() methods without success. Unique always seems to drop the column that I am not querying on (in this case, season).
Any help is appreciated!
CodePudding user response:
Do you need groupby
with nunique
?
df[df.groupby('Name')['Actor'].transform('nunique').gt(1)]
Name Seasons Actor
0 Stranger Things 3 Millie
4 Stranger Things 3 Todd
CodePudding user response:
This will return a dataframe with those two rows that you show in your post:
actor_cts = df.drop_duplicates(subset=['Name','Actor']).groupby("Name")['Actor'].count()
df[df.Name.isin(actor_cts[actor_cts > 1].index)].reset_index(drop=True)
# Name Seasons Actor
# 0 Stranger Things 3 Millie
# 1 Stranger Things 3 Todd
CodePudding user response:
this is what I found based on your code! you could change the check_if_similar function to go through all of the data frames Keys too but unfortunately, I Don't have the time to put all of that together. If you have any questions feel free to ask me.
import pandas as pd
def check_if_similar(things):
similar = []
num1 = 0
for _ in things:
num2 = 0
for _ in things:
if num1 != num2:
if things[num1] == things[num2] and num1 > num2:
similar.append([num1, num2])
num2 = 1
num1 = 1
return similar
series = [('Stranger Things', 3, 'Millie'),
('Game of Thrones', 8, 'Emilia'), ('La Casa De Papel', 4, 'Sergio'),
('Westworld', 3, 'Evan Rachel'), ('Stranger Things', 3, 'Todd'),
('La Casa De Papel', 4, 'Sergio')]
# Create a DataFrame object
df = pd.DataFrame(series, columns=['Name', 'Seasons', 'Actor'])
info = check_if_similar(df['Name'].to_list())
for coords in info:
if str(df.values[coords[0]]) != str(df.values[coords[1]]):
print(str(df.values[coords[0]]) "\n" str(df.values[coords[1]]))
else:
print(str(df.values[coords[0]]) " == " str(df.values[coords[1]]))