Home > Mobile >  Pandas GroupBy, to compare the new column of list of numbers, to another column of list of numbers
Pandas GroupBy, to compare the new column of list of numbers, to another column of list of numbers

Time:07-13

A data frame is as the left of screenshot below.

I want to GroupBy the names, and find out which numbers (compared to [1,2,3,4,5]) are missing.

The ideal output is as the right of screenshot.

I've tried below codes. But the column 'Number of stars' after the GroupBy, is being taken as a list of strings. So it doesn't perform the comparison.

Any help in the way to fix it, please? Thank you.

enter image description here

import pandas as pd
from io import StringIO

csvfile = StringIO("""
Name    Number of stars
Benjamin    1,3,2,1,2
Benjamin    2,5,1,3
Emma    2,1,1,4,4,2
Ethan   2,5,4
Emma    2,2,2
Ethan   5,4,4,1,1,1
Olivia  4,1,3,5""")

df = pd.read_csv(csvfile, sep = '\t', engine='python')

df_1 = df.groupby('Name')['Number of stars'].apply(list)

df_1 = df_1.to_frame().reset_index()

df_1['all stars'] = pd.Series([list(range(1,6)) for x in range(len(df_1.index))])
df_1['diff'] = df_1['all stars'].map(set) - df_1['Number of stars'].map(set)

print (df_1)

Output:

   Name       Number of stars        all stars             diff
0  Benjamin  [1,3,2,1,2, 2,5,1,3]  [1, 2, 3, 4, 5]  {1, 2, 3, 4, 5}
1      Emma  [2,1,1,4,4,2, 2,2,2]  [1, 2, 3, 4, 5]  {1, 2, 3, 4, 5}
2     Ethan  [2,5,4, 5,4,4,1,1,1]  [1, 2, 3, 4, 5]  {1, 2, 3, 4, 5}
3    Olivia             [4,1,3,5]  [1, 2, 3, 4, 5]  {1, 2, 3, 4, 5}

CodePudding user response:

Let us group the dataframe by Name and aggregate Number of stars with custom lambda function that returns the required set difference:

s = set(range(1, 6))
df.groupby('Name')['Number of stars']\
  .agg(lambda x: s - set(int(z) for y in x for z in y.split(',')))

Name
Benjamin       {4}
Emma        {3, 5}
Ethan          {3}
Olivia         {2}
Name: Number of stars, dtype: object

CodePudding user response:

Try:

out = (
    df.groupby("Name")["Number of stars"]
    .apply(
        lambda x: ",".join(set("12345").difference(x.str.split(",").explode()))
    )
    .reset_index(name="diff")
)
print(out)

Prints:

       Name diff
0  Benjamin    4
1      Emma  3,5
2     Ethan    3
3    Olivia    2

CodePudding user response:

You can convert comma separated string to list of string then explode and groupby to get set

df['Number of stars'] = df['Number of stars'].str.split(',')
df = df.explode('Number of stars')
df['Number of stars'] = df['Number of stars'].astype(int)

df_1 = (df.groupby('Name')['Number of stars'].apply(set)
        .to_frame().reset_index())
df_1['diff'] = df_1['Number of stars'].apply(lambda x: set(range(1,6))-x)
print(df_1)

       Name Number of stars    diff
0  Benjamin    {1, 2, 3, 5}     {4}
1      Emma       {1, 2, 4}  {3, 5}
2     Ethan    {1, 2, 4, 5}     {3}
3    Olivia    {1, 3, 4, 5}     {2}
  • Related