Home > Blockchain >  Joining multiple rows into comma separated strings by group in Python
Joining multiple rows into comma separated strings by group in Python

Time:10-08

I have a dataframe similar to the one below:

ID Award Type Date
01 PELL FED 2021-06-01
01 SCH LOC 2021-06-01
02 SCH LOC 2021-06-04
03 GRANT STA 2021-06-02
03 PELL FED 2021-06-15
03 SCH LOC 2021-07-01

I want to convert this to the following dataframe:

ID Award Type Date
01 PELL, SCH FED, LOC 2021-06-01, 2021-06-01
02 SCH LOC 2021-06-04
03 GRANT, PELL, SCH STA, FED, LOC 2021-06-02, 2021-06-15, 2021-07-01

In this case, I am grouping by ID. However, there may be more than one grouping column (for instance, ID and TERM instead of just ID). I have the grouping columns stored in a list variable named 'keys'.

Each grouping could have a different number of items, but the non-grouping columns all have the same number of items for each group (for ID=01 in the dataframes above, the other columns will all have 2 rows).

All the columns are read in as strings from a csv file (using dtype=str on pd.read_csv) to prevent changing any of the values from their original value (I don't want numbers reinterpreted or date formats changed).

I have tried something along the lines of the following, but none of these seem to work.

import pandas as pd 
keys = ['ID']
df = pd.DataFrame({
    'ID' : ['01','01','02','03','03','03'],
    'Award' : ['PELL','SCH','SCH','GRANT','PELL','SCH'],
    'Type' : ['FED','LOC','LOC','STA','FED','LOC'],
    'Date' : ['2021-06-01','2021-06-01','2021-06-04','2021-06-02','2021-06-15','2021-07-01'],
})
dfb = df.groupby(keys).apply(', '.join) # This results in the column names being joined together, not the column values
dfc = df.groupby(keys).agg(list) # This results in lists instead of concatenated strings

Output:

dfb
ID
01    ID, Award, Type, Date
02    ID, Award, Type, Date
03    ID, Award, Type, Date
dtype: object

dfc
                 Award             Type                                  Date
ID                                                                           
01         [PELL, SCH]       [FED, LOC]              [2021-06-01, 2021-06-01]
02               [SCH]            [LOC]                          [2021-06-04]
03  [GRANT, PELL, SCH]  [STA, FED, LOC]  [2021-06-02, 2021-06-15, 2021-07-01]

I would love a pointer as to where to go with this. I am still struggling with this aspect of Python, for sure!

CodePudding user response:

Try this -

  1. Create a dictionary that has all the required columns except ID as key and lambda x: list(x) as function.
  2. Use groupby with agg to apply the independent functions on each column.
  3. If you want to convert the list to a concatenated string, then just change the lambda function to lambda x: ', '.join(list(x))

More details on how to work with complex groupby and aggregates can be found on my blog here, if you are interested.

g = {i:lambda x: ', '.join(list(x)) for i in df.columns[1:]}

output = df.groupby(['ID']).agg(g).reset_index()
print(output)
   ID             Award           Type                                Date
0  01         PELL, SCH       FED, LOC              2021-06-01, 2021-06-01
1  02               SCH            LOC                          2021-06-04
2  03  GRANT, PELL, SCH  STA, FED, LOC  2021-06-02, 2021-06-15, 2021-07-01

EDIT:

If the goal is to only get a string with comma separation, then a shorter way as suggested by @Henry Ecker is ..

output = df.groupby(['ID'], as_index=False).agg(', '.join)

.. using only the aggregate with the method itself.

CodePudding user response:

You can use pivot_table for exactly this:

df.pivot_table(index="ID", values=["Award", "Type", "Date"], aggfunc=lambda x: ", ".join(x))

result:

               Award                                Date           Type
ID
01         PELL, SCH              2021-06-01, 2021-06-01       FED, LOC
02               SCH                          2021-06-04            LOC
03  GRANT, PELL, SCH  2021-06-02, 2021-06-15, 2021-07-01  STA, FED, LOC

If the arguments are confusing (pivots can take some getting used to): index is the column you want the resulting table to be indexed by, values are the the columns you want to aggregate in some way keyed by index values, and aggfunc is a function to convert that collection/list to a single value.

  • Related