I have two datasets that I am doing a left outer merge on in Pandas. Here's the first:
Name Address
0 Joe Schmuckatelli 123 Main Street
1 Fred Putzarelli 456 Pine Street
2 Harry Cox 789 Vine Street
And the second:
Address InvoiceNum
0 123 Main Street 51450
1 456 Pine Street 51389
2 789 Vine Street 58343
3 123 Main Street 52216
4 456 Pine Street 53124-001
5 789 Vine Street 61215
6 789 Vine Street 51215-001
The merged data looks like this:
Name Address InvoiceNum
0 Joe Schmuckatelli 123 Main Street 51450
1 Joe Schmuckatelli 123 Main Street 52216
2 Fred Putzarelli 456 Pine Street 51389
3 Fred Putzarelli 456 Pine Street 53124-001
4 Harry Cox 789 Vine Street 58343
5 Harry Cox 789 Vine Street 61215
6 Harry Cox 789 Vine Street 51215-001
Ideally I would like to have one line per address with all of the invoice numbers for that address in the third column, like this:
Name Address InvoiceNum
0 Joe Schmuckatelli 123 Main Street 51450, 52216
1 Fred Putzarelli 456 Pine Street 51389, 53124-001
2 Harry Cox 789 Vine Street 58343, 61215, 51215-001
The code I used to merge the data looks like this:
mergedData = pd.merge(complaintData, invoiceData, on='Address', how='left')
Is there a way to do this easily in Pandas or some other way?
CodePudding user response:
We can groupby aggregate
the values in df2
by joining strings together for each Address before join
/ merge
with df1
:
new_df = df1.join(
df2.groupby('Address')['InvoiceNum'].aggregate(', '.join),
on='Address',
how='left'
)
new_df
:
Name Address InvoiceNum
0 Joe Schmuckatelli 123 Main Street 51450, 52216
1 Fred Putzarelli 456 Pine Street 51389, 53124-001
2 Harry Cox 789 Vine Street 58343, 61215, 51215-001
*Either join
or merge
work here, although join
has slightly less overhead in this case since the result of groupby
has Address
as the index.
Setup:
import pandas as pd
df1 = pd.DataFrame({
'Name': ['Joe Schmuckatelli', 'Fred Putzarelli', 'Harry Cox'],
'Address': ['123 Main Street', '456 Pine Street', '789 Vine Street']
})
df2 = pd.DataFrame({
'Address': ['123 Main Street', '456 Pine Street', '789 Vine Street',
'123 Main Street', '456 Pine Street', '789 Vine Street',
'789 Vine Street'],
'InvoiceNum': ['51450', '51389', '58343', '52216', '53124-001', '61215',
'51215-001']
})