python, pandas: How to specify multiple columns and merge only specific columns of duplicate rows-CodePudding

You can concatenate specific column values in a multi-column duplicate row by doing the following, but all columns other than those specified in the groupby will disappear.

The columns title, thumbnail, name, created_at are present.

Retrieve the rows with duplicate title and thumbnail as follows and concatenate the values of the name column of the duplicated row while keeping the first row. I want to create a new column for it.

However, as mentioned earlier, columns other than those specified in groupby will disappear.

df.groupby(['title', 'thumbnail'])['name'].apply(lambda x: ' '.join(x)).reset_index()

CodePudding user response：

Suppose the following dataframe:

>>> df
    title thumbnail   name   created_at
0  title1    thumb1  name1        today
1  title1    thumb1  name2    yesterday
2  title1    thumb2  name3  another day

The output of your code is:

>>> df.groupby(['title', 'thumbnail'], as_index=False)['name'] \
      .apply(' '.join)
    title thumbnail         name
0  title1    thumb1  name1 name2
1  title1    thumb2        name3

If you don't want to lost columns and rows (keep the shape), use transform:

df['name'] = df.groupby(['title', 'thumbnail'])['name'] \
               .transform(' '.join)
print(df)

# Output:
    title thumbnail         name   created_at
0  title1    thumb1  name1 name2        today
1  title1    thumb1  name1 name2    yesterday
2  title1    thumb2        name3  another day

Else you have to make a choice with other columns to keep them. In this case, do you want to keep 'today' or 'yesterday' for created_at? To do that, you can use agg:

>>> df.groupby(['title', 'thumbnail']) \
      .agg({'name': ' '.join, 'created_at': 'first'}) \
      .reset_index()

    title thumbnail         name   created_at
0  title1    thumb1  name1 name2        today
1  title1    thumb2        name3  another day

Setup:

data = {'title': ['title1', 'title1', 'title1'],
        'thumbnail': ['thumb1', 'thumb1', 'thumb2'],
        'name': ['name1', 'name2', 'name3'],
        'created_at': ['today', 'yesterday', 'another day']}
df = pd.DataFrame(data)

CodePudding user response：

Using a toy DataFrame for illustration:

df = pd.DataFrame({
    'title': ['tom', 'tom', 'tom', 'mark', 'mark', 'lewis'],
    'name': list('abcdef'),
    'marks': [55, 99, 14, 28, 19, 88]
})

In any case we will need to group:

groups = df.groupby(['title', 'thumbnail'])

Here is a neat solution using a join:

groups.first().join(groups['name'].agg(' '.join), rsuffix='s')

A more efficient solution would get the name aggregation and the rest of the columns in a single pass:

def process(group):
  result = group.iloc[0] # take the first row
  # then add a concatenation of all names for this group
  result['schools'] = ' '.join(group['name'])
  # return the result data frame with a single row
  return result

This could also be done in a single line:

def process(group):
  return group.iloc[[0]].assign(names=' '.join(group['name']))

Then simply apply the helper function to all the groups:

groups.apply(process)

The two methods get the same results:

       title   name  marks   names
name                             
lewis  lewis      f     88       f
mark    mark      d     28     d e
tom      tom      a     55   a b c

CodePudding user response：

That's because you're selecting the name column via [''], so by definition, the only columns available are going to be the columns that make up the index (which is required) and the column you're selecting.

Instead of calling apply on the ["name"] column of the groupby, call apply directly on the groupby:

df.groupby(['title', 'thumbnail']).apply(lambda x: ' '.join(x['name'])).reset_index()