I have a dataframe similar to the one below:
ID | Award | Type | Date |
---|---|---|---|
01 | PELL | FED | 2021-06-01 |
01 | SCH | LOC | 2021-06-01 |
02 | SCH | LOC | 2021-06-04 |
03 | GRANT | STA | 2021-06-02 |
03 | PELL | FED | 2021-06-15 |
03 | SCH | LOC | 2021-07-01 |
I want to convert this to the following dataframe:
ID | Award | Type | Date |
---|---|---|---|
01 | PELL, SCH | FED, LOC | 2021-06-01, 2021-06-01 |
02 | SCH | LOC | 2021-06-04 |
03 | GRANT, PELL, SCH | STA, FED, LOC | 2021-06-02, 2021-06-15, 2021-07-01 |
In this case, I am grouping by ID. However, there may be more than one grouping column (for instance, ID and TERM instead of just ID). I have the grouping columns stored in a list variable named 'keys'.
Each grouping could have a different number of items, but the non-grouping columns all have the same number of items for each group (for ID=01 in the dataframes above, the other columns will all have 2 rows).
All the columns are read in as strings from a csv file (using dtype=str on pd.read_csv) to prevent changing any of the values from their original value (I don't want numbers reinterpreted or date formats changed).
I have tried something along the lines of the following, but none of these seem to work.
import pandas as pd
keys = ['ID']
df = pd.DataFrame({
'ID' : ['01','01','02','03','03','03'],
'Award' : ['PELL','SCH','SCH','GRANT','PELL','SCH'],
'Type' : ['FED','LOC','LOC','STA','FED','LOC'],
'Date' : ['2021-06-01','2021-06-01','2021-06-04','2021-06-02','2021-06-15','2021-07-01'],
})
dfb = df.groupby(keys).apply(', '.join) # This results in the column names being joined together, not the column values
dfc = df.groupby(keys).agg(list) # This results in lists instead of concatenated strings
Output:
dfb
ID
01 ID, Award, Type, Date
02 ID, Award, Type, Date
03 ID, Award, Type, Date
dtype: object
dfc
Award Type Date
ID
01 [PELL, SCH] [FED, LOC] [2021-06-01, 2021-06-01]
02 [SCH] [LOC] [2021-06-04]
03 [GRANT, PELL, SCH] [STA, FED, LOC] [2021-06-02, 2021-06-15, 2021-07-01]
I would love a pointer as to where to go with this. I am still struggling with this aspect of Python, for sure!
CodePudding user response:
Try this -
- Create a dictionary that has all the required columns except
ID
as key andlambda x: list(x)
as function. - Use
groupby
withagg
to apply the independent functions on each column. - If you want to convert the
list
to a concatenated string, then just change the lambda function tolambda x: ', '.join(list(x))
More details on how to work with complex groupby and aggregates can be found on my blog here, if you are interested.
g = {i:lambda x: ', '.join(list(x)) for i in df.columns[1:]}
output = df.groupby(['ID']).agg(g).reset_index()
print(output)
ID Award Type Date
0 01 PELL, SCH FED, LOC 2021-06-01, 2021-06-01
1 02 SCH LOC 2021-06-04
2 03 GRANT, PELL, SCH STA, FED, LOC 2021-06-02, 2021-06-15, 2021-07-01
EDIT:
If the goal is to only get a string with comma separation, then a shorter way as suggested by @Henry Ecker is ..
output = df.groupby(['ID'], as_index=False).agg(', '.join)
.. using only the aggregate with the method itself.
CodePudding user response:
You can use pivot_table
for exactly this:
df.pivot_table(index="ID", values=["Award", "Type", "Date"], aggfunc=lambda x: ", ".join(x))
result:
Award Date Type
ID
01 PELL, SCH 2021-06-01, 2021-06-01 FED, LOC
02 SCH 2021-06-04 LOC
03 GRANT, PELL, SCH 2021-06-02, 2021-06-15, 2021-07-01 STA, FED, LOC
If the arguments are confusing (pivots can take some getting used to): index
is the column you want the resulting table to be indexed by, values
are the the columns you want to aggregate in some way keyed by index
values, and aggfunc
is a function to convert that collection/list to a single value.