I am re-writing a VBScript that I wrote in Excel a while ago in Python Pandas. The reason for this is due to it breaking regularly as using VBS Macros in an Apple version of Microsoft Office is unstable.
The script creates three separate CSVs from one large XLS export from our database which are for three different Mail lists. unfortunately, we have no access to edit how the export is made so I am trying to make it quick and easy for the people who need these mail lists updated regularly.
Due to it containing confidential and personal information I won't use real data but will make some dummy data instead.
I am up to a part where there is one contact person for two clients and I need to merge and join this data. Getting it to the file is the easy part but merging and joining is the hard part. Here is an example of what I'm trying to do:
In words, I want pandas to detect a duplicate contact merge the identical data and add a Join with an &
separation in the unique data being joined.
RAW:
Last Name | First Name | Preferred Name | Type | ID | Contact First Name | Contact Last Name | |
---|---|---|---|---|---|---|---|
TMart | Bob | Bob | TypeA | 22 | Sully | TMart | [email protected] |
TMart | Jannet | Jan | TypeB | 23 | Sully | TMart | [email protected] |
Required Output:
Last Name | First Name | Preferred Name | Type | ID | Contact First Name | Contact Last Name | |
---|---|---|---|---|---|---|---|
TMart | Bob & Jannet | Bob & Jan | TypeA & TypeB | 22 & 23 | Sully | TMart | [email protected] |
Below is the line of Code that would work for one Collumn but if I added multiple to it, it would break and just merge the column names:
df2 = df1.groupby(['Last Name','ID','Contact First Name','Contact Last Name','Email'])['First Name'].apply(' & '.join).reset_index()
Which would Output exactly what I want: Bob & Jannet
But, if I added more to the .apply object.
df2 = df1.groupby(['Last Name','ID','Contact First Name','Contact Last Name','Email'])['First Name','ID'].apply(' & '.join).reset_index()
It would output First Name & ID
instead of the data in those columns.
I am not fully sure this method will be the right one as two people could have the same Type
and I don't want it to join those together.
Sorry if I am not clear enough, did my best to explain what I was aiming for. Doesn't have to be written this way, happy to be shown a new and more efficient way but instead of spending days figuring it out I thought I would ask the experts.
Thanks!
CodePudding user response:
Use GroupBy.agg
and add nested lists for processing columns by aggregate function []
:
df2 = df1.groupby(['Last Name','ID','Contact First Name','Contact Last Name','Email'])[['First Name','ID']].agg(' & '.join).reset_index()