Merge column data from some rows with Pandas-CodePudding

I have a CSV file and I want to merge the data of some columns from some rows.

For example I have :

Name	ID	Data	Parent
Name1	100	Blue	001
Name1	101	Grey	001
Name1	102	Grey	001
Name1	103	Black	001
Name2	200	Red	002
Name2	201	Green	002

And I would like to have :

Name	ID	Data	Parent
Name1	100	Blue	001
Name1	101	Grey	001
Name1	102	Grey	001
Name1	103	Black	001
Name2	200	Red	002
Name2	201	Green	002
Name1	001	Blue,Grey,Black
Name2	002	Red,Green

My file have 62,000 rows. I can change some headers, add/delete/modify columns and all in less than 5 seconds. But on this part I'm getting a bit stuck. Do you have any idea how to do this?

Thank you for your help !

CodePudding user response：

You can get the list of unique colors per parent with groupby.unique, and then make them a single value with .str.join:

>>> parents = df.groupby(['Name', 'Parent'])['Data'].unique().str.join(',')
>>> parents
Name    Parent
Name1   001       Blue,Grey,Black
Name2   002             Red,Green
Name: Data, dtype: object

Then some tweaking to make it fit the initial dataframe’s format, and then we can simply append it:

>>> parents = parents.reset_index().rename(columns={'Parent': 'ID'}).assign(Parent='')
>>> parents
     Name   ID             Data Parent
0  Name1   001  Blue,Grey,Black       
1  Name2   002        Red,Green       
>>> df.append(parents)
     Name   ID             Data Parent
0  Name1   100             Blue    001
1  Name1   101             Grey    001
2  Name1   102             Grey    001
3  Name1   103            Black    001
4  Name2   200              Red    002
5  Name2   201            Green    002
0  Name1   001  Blue,Grey,Black       
1  Name2   002        Red,Green

CodePudding user response：

You can use groupby agg to compute the new rows, then pandas.concat to merge it with the original dataframe:

pd.concat([df,
           (df.groupby('Parent')
              .agg({'Name': 'first', 'Parent': 'first', 'Data': ','.join})
              .rename(columns={'Parent': 'ID'})
              .reset_index(drop=True)
           )
          ]).fillna('') # optional fillna to remove the NaNs

    Name   ID                  Data Parent
0  Name1  100                  Blue    001
1  Name1  101                  Grey    001
2  Name1  102                  Grey    001
3  Name1  103                 Black    001
4  Name2  200                   Red    002
5  Name2  201                 Green    002
0  Name1  001  Blue,Grey,Grey,Black       
1  Name2  002             Red,Green

CodePudding user response：

please use merge command in pandas

df1.merge(df2, left_on='lkey', right_on='rkey')

In pandas merge these merge types are available

left’, ‘right’, ‘outer’, ‘inner’, ‘cross’

Read this link for further details