Home > OS >  Consolidating data in Pandas
Consolidating data in Pandas

Time:11-02

I have two datasets that I am doing a left outer merge on in Pandas. Here's the first:

                Name          Address
0  Joe Schmuckatelli  123 Main Street
1    Fred Putzarelli  456 Pine Street
2          Harry Cox  789 Vine Street

And the second:

           Address InvoiceNum
0  123 Main Street      51450
1  456 Pine Street      51389
2  789 Vine Street      58343
3  123 Main Street      52216
4  456 Pine Street  53124-001
5  789 Vine Street      61215
6  789 Vine Street  51215-001

The merged data looks like this:

                Name          Address InvoiceNum
0  Joe Schmuckatelli  123 Main Street      51450
1  Joe Schmuckatelli  123 Main Street      52216
2    Fred Putzarelli  456 Pine Street      51389
3    Fred Putzarelli  456 Pine Street  53124-001
4          Harry Cox  789 Vine Street      58343
5          Harry Cox  789 Vine Street      61215
6          Harry Cox  789 Vine Street  51215-001                

Ideally I would like to have one line per address with all of the invoice numbers for that address in the third column, like this:

                Name          Address InvoiceNum
0  Joe Schmuckatelli  123 Main Street      51450, 52216
1    Fred Putzarelli  456 Pine Street      51389, 53124-001
2          Harry Cox  789 Vine Street      58343, 61215, 51215-001

The code I used to merge the data looks like this:

mergedData = pd.merge(complaintData, invoiceData, on='Address', how='left')

Is there a way to do this easily in Pandas or some other way?

CodePudding user response:

We can groupby aggregate the values in df2by joining strings together for each Address before join / merge with df1:

new_df = df1.join(
    df2.groupby('Address')['InvoiceNum'].aggregate(', '.join),
    on='Address',
    how='left'
)

new_df:

                Name          Address               InvoiceNum
0  Joe Schmuckatelli  123 Main Street             51450, 52216
1    Fred Putzarelli  456 Pine Street         51389, 53124-001
2          Harry Cox  789 Vine Street  58343, 61215, 51215-001

*Either join or merge work here, although join has slightly less overhead in this case since the result of groupby has Address as the index.


Setup:

import pandas as pd

df1 = pd.DataFrame({
    'Name': ['Joe Schmuckatelli', 'Fred Putzarelli', 'Harry Cox'],
    'Address': ['123 Main Street', '456 Pine Street', '789 Vine Street']
})

df2 = pd.DataFrame({
    'Address': ['123 Main Street', '456 Pine Street', '789 Vine Street',
                '123 Main Street', '456 Pine Street', '789 Vine Street',
                '789 Vine Street'],
    'InvoiceNum': ['51450', '51389', '58343', '52216', '53124-001', '61215',
                   '51215-001']
})
  • Related