I need some help concatenating some strings from a DF.
I have the below "original" df:
df = pd.DataFrame({'Material': [1,1,1,2,2,2,3,3,3,3],
'BOM': ['A','B',np.nan,'A',np.nan,'C','A','A','B','C']})
Material | BOM | |
---|---|---|
0 | 1 | A |
1 | 1 | B |
2 | 1 | NaN |
3 | 2 | A |
4 | 2 | NaN |
5 | 2 | C |
6 | 3 | A |
7 | 3 | A |
8 | 3 | B |
9 | 3 | C |
Expected Result is:
Material | BOM |
---|---|
1 | A, B |
2 | A, C |
3 | A, B, C |
I'm new with Python and am thinking of doing it with a mix of Loop and Dict but definitely think there must be a better way.
The tricky bit is that I think I need to group by material and check column "BOM" for the distinct options (different than NaN) then some sort of ", ".join them.
I've tried the following:
df.groupby('Material')['BOM'].apply(', '.join)
And got the error: "TypeError: sequence item 2: expected str instance, float found"
If anyone could help me I would be really grateful
CodePudding user response:
You can dropna
and drop_duplicates
, then GroupBy.agg
:
df.dropna().drop_duplicates().groupby('Material').agg(', '.join)
or:
df.dropna().groupby('Material').agg(lambda s: ', '.join(s.drop_duplicates()))
output:
BOM
Material
1 A, B
2 A, C
3 A, B, C
CodePudding user response:
3 solutions depending on output requirements:
Using .agg
:
df.dropna().drop_duplicates().groupby('Material').agg(', '.join)
Output:
BOM
Material
1 A, B
2 A, C
3 A, B, C
Using .apply(list)
:
df.dropna().drop_duplicates().groupby('Material')['BOM'].apply(list).reset_index()
Output:
Material BOM
0 1 [A, B]
1 2 [A, C]
2 3 [A, B, C]
Using .apply(set)
: df.dropna().groupby('Material')['BOM'].apply(set).reset_index()
Output:
Material BOM
0 1 {A, B}
1 2 {A, C}
2 3 {A, C, B}