I have pandas data frame table like this:
Unique ID | Count | Samples |
---|---|---|
A-1 | 3 | 6.1 (Apple) |
A-1 | 3 | 4.5 (Banana) |
A-1 | 3 | 5.6 (Kiwi) |
A-2 | 2 | 7.2 (Orange) |
A-2 | 2 | 3.6 (Mango) |
I would like to convert the column "Samples" to a list/array and obtain only the first occurrence of the other columns. The resulting table should look like this
Unique ID | Count | SampleSummary |
---|---|---|
A-1 | 3 | 6.1 (Apple), 4.5 (Banana), 5.6 (Kiwi) |
A-2 | 2 | 7.2 (Orange), 3.6(Mango) |
So far, this is what I have in my code:
for i in range(len(dataframe['unique_id'])):
SampleSummary= np.asarray(pd.concat([dataframe['Sample']]))
CodePudding user response:
Create dictionary for aggregate by GroupBy.first
without column Unique ID
, for column Samples
aggregate by join
by GroupBy.agg
:
d = dict.fromkeys(dataframe.columns.difference(['Unique ID']), 'first')
d['Samples'] = ', '.join
df = dataframe.groupby('Unique ID', as_index=False).agg(d)
print (df)
Unique ID Count Samples
0 A-1 3 6.1 (Apple), 4.5 (Banana), 5.6 (Kiwi)
1 A-2 2 7.2 (Orange), 3.6 (Mango)
Or is possible use GroupBy.transform
for column filled by joined strings and then use DataFrame.drop_duplicates
:
dataframe['Samples'] = dataframe.groupby('Unique ID')['Samples'].transform(', '.join)
df = dataframe.drop_duplicates('Unique ID')
print (df)
Unique ID Count Samples
0 A-1 3 6.1 (Apple), 4.5 (Banana), 5.6 (Kiwi)
3 A-2 2 7.2 (Orange), 3.6 (Mango)