Home > Software engineering >  Join unique values in a column based on intersection of other columns in pandas
Join unique values in a column based on intersection of other columns in pandas

Time:01-03

Let's say I have the following Dataframe:

df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                         "bar", "bar", "bar", "bar","bar"],
                   "B": ["one", "one", "one", "two", "two",
                         "one", "one", "two", "two","two"],
                   "C": ["small", "large", "large", "small",
                         "small", "large", "small", "small",
                         "large", "large"],
                   "D": [1, 2, 3, 4, 5, 6, 7, 8, 9,99999]})

I'd like to join (concatenate? or merge?) values in "D" column if there is an intersection of values in "A", "B" and "C" . By intersection, what I mean is that I want to have this DataFrame:

    A   B   C   D
0   foo one small   1
1   foo one large   2,3
2   foo two small   4,5
3   bar one large   6
4   bar one small   7
5   bar two small   8
6   bar two large   9,99999

There are aggregation functions like min, max, sum etc, but I couldn't come up with a solution at all.

CodePudding user response:

Convert column D to strings, so possible aggregate by join in GroupBy.agg:

df1 = (df.assign(D = df.D.astype(str))
        .groupby(['A','B','C'], sort=False)['D']
        .agg(','.join)
        .reset_index())
print (df1)
     A    B      C        D
0  foo  one  small        1
1  foo  one  large      2,3
2  foo  two  small      4,5
3  bar  one  large        6
4  bar  one  small        7
5  bar  two  small        8
6  bar  two  large  9,99999

Or use lambda function:

df1 = (df.groupby(['A','B','C'], sort=False)['D']
        .agg(lambda x: ','.join(x.astype(str)))
        .reset_index())
print (df1)
     A    B      C        D
0  foo  one  small        1
1  foo  one  large      2,3
2  foo  two  small      4,5
3  bar  one  large        6
4  bar  one  small        7
5  bar  two  small        8
6  bar  two  large  9,99999

If possible duplicated values in D per groups and need unique values add DataFrame.drop_duplicates or Series.unique:

df2 = (df.assign(D = df.D.astype(str))
         .drop_duplicates(['A','B','C','D'])
         .groupby(['A','B','C'], sort=False)['D']
         .agg(','.join)
         .reset_index())

df2 = (df.groupby(['A','B','C'], sort=False)['D']
        .agg(lambda x: ','.join(x.astype(str).unique()))
        .reset_index())
  • Related