I'm new to Python and could use some help.
Say I have a dataset that looks like this:
Serial Number Source
0 AB100 Donatelle
1 AB200 Qure
2 AB100 Donatelle
3 AB200 Qure
4 AB100 Grand Avenue
5 AB200 Eagle Services
6 AB300 Donatelle
7 AB300 Donatelle
8 AB100 Qure
9 AB100 Eagle Services
And I need to add a column to it, like this:
Serial Number Source SN Data Sources
0 AB100 Donatelle Donatelle, Grand Avenue, Qure, Eagle Services
1 AB200 Qure Qure, Eagle Services
2 AB100 Donatelle Donatelle, Grand Avenue, Qure, Eagle Services
3 AB200 Qure Qure, Eagle Services
4 AB100 Grand Avenue Donatelle, Grand Avenue, Qure, Eagle Services
5 AB200 Eagle Services Qure, Eagle Services
6 AB300 Donatelle Donatelle
7 AB300 Donatelle Donatelle
8 AB100 Qure Donatelle, Grand Avenue, Qure, Eagle Services
9 AB100 Eagle Services Donatelle, Grand Avenue, Qure, Eagle Services
My knowledge is still limited, pardon me.
I'm working with a 40k rows dataframe and I need to generate a column that brings all different sources that are present in the dataframe for the serial number in each row.
Can anybody help me out? Thanks and kind regards
CodePudding user response:
You can groupby on "Serial Number" column and apply a list to "Source" column.
Next create a dictionary of the grouped df and convert it to a df.
Finally merge the dfs together and clean up the columns.
data = {
"Serial Number": ["AB100", "AB200", "AB100", "AB200", "AB100", "AB200"],
"Source": ["Donatelle", "Qure", "Grand Avenue", "Eagle Services", "Qure", "Grand Avenue"]
}
df = pd.DataFrame(data)
grouped_df = df.groupby("Serial Number")["Source"].apply(list).reset_index()
mapping = grouped_df.set_index("Serial Number")["Source"].to_dict()
mapping_df = pd.DataFrame.from_dict(mapping, orient="index").unstack().reset_index()
final_df = pd.merge(
grouped_df,
mapping_df,
left_on="Serial Number",
right_on="level_1"
).rename(columns={0: "Source", "Source": "SN Data Sources"}[["Serial Number", "Source", "SN Data Sources"]]
print(final_df)
Serial Number Source SN Data Sources
0 AB100 Donatelle [Donatelle, Grand Avenue, Qure]
1 AB100 Grand Avenue [Donatelle, Grand Avenue, Qure]
2 AB100 Qure [Donatelle, Grand Avenue, Qure]
3 AB200 Qure [Qure, Eagle Services, Grand Avenue]
4 AB200 Eagle Services [Qure, Eagle Services, Grand Avenue]
5 AB200 Grand Avenue [Qure, Eagle Services, Grand Avenue]
CodePudding user response:
df = pd.DataFrame(
{'Serial Number': ['AB100', 'AB200', 'AB100', 'AB200', 'AB100', 'AB200', 'AB300', 'AB300', 'AB100', 'AB100'],
'Source': ['Donatelle', 'Qure', 'Donatelle', 'Qure', 'Grand Avenue', 'Eagle Services', 'Donatelle', 'Donatelle',
'Qure', 'Eagle Services']})
df = df.merge(df.groupby('Serial Number').agg(set)['Source'].apply(lambda x: ', '.join(x)), how='left',
on='Serial Number').rename(columns={'Source_x': 'Source', 'Source_y': 'SN Data Sources'})
print(df)
Prints:
Serial Number Source SN Data Sources
0 AB100 Donatelle Donatelle, Eagle Services, Grand Avenue, Qure
1 AB200 Qure Eagle Services, Qure
2 AB100 Donatelle Donatelle, Eagle Services, Grand Avenue, Qure
3 AB200 Qure Eagle Services, Qure
4 AB100 Grand Avenue Donatelle, Eagle Services, Grand Avenue, Qure
5 AB200 Eagle Services Eagle Services, Qure
6 AB300 Donatelle Donatelle
7 AB300 Donatelle Donatelle
8 AB100 Qure Donatelle, Eagle Services, Grand Avenue, Qure
9 AB100 Eagle Services Donatelle, Eagle Services, Grand Avenue, Qure