Home > Enterprise >  Joining multiple spreadsheets and appending data in rows when duplicates exist, using python
Joining multiple spreadsheets and appending data in rows when duplicates exist, using python

Time:10-16

I am trying to join certain fields from multiple .CSV files together based on a key field. However, in instances when there are duplicates, I want to append the information to an existing field.

Example of DF data and Results/Desired Results

Does anyone know a method of doing this?

This is an example of the current code I have, which works to use specified fields and join data, but leads to duplicate entries:

DF1 = pd.read_csv(('facilities.csv'), header = 0, dtype = object)
DF2 = pd.read_csv(('permits.csv'), header = 0, dtype = object)
DF3 = pd.read_csv(('programs.csv'), header = 0, dtype = object)


# Select only necessary columns from CSVs
DF1_reduc = DF1[['ID','FACILITY_TYPE_CODE','FACILITY_NAME','LOCATION_ADDRESS']]
DF2_reduc = DF2[['ID','ACTIVITY_ID','PERMIT_NAME','PERMIT_STATUS_CODE']]
DF3_reduc = DF3[['ID','PROG_CODE']]


#Joining all tables together
joined_tables = [DF1_reduc, DF2_reduc, DF3_reduc]
joined_tables = [table.set_index('ID') for table in joined_tables]
joined_tables = joined_tables[0].join(joined_tables[1:])

CodePudding user response:

import pandas as pd 
PERMIT_NAME = []
PERMIT_STATUS_CODE = []
ACTIVITY_ID = []
df['ACTIVITY_ID'] = df['ACTIVITY_ID'].apply(lambda x: str(x))
for ID in df.ID.unique():
    subset = df[df["ID"] == ID]
    PERMIT_NAME.append(", ".join(subset['PERMIT_NAME'].unique() ))
    PERMIT_STATUS_CODE.append( ", ".join(subset['PERMIT_STATUS_CODE'].unique() ))
    ACTIVITY_ID.append(", ".join( subset['ACTIVITY_ID'].unique() ))

    
zz = df.drop(['PERMIT_NAME', 'PERMIT_STATUS_CODE', 'ACTIVITY_ID'], axis = 1).drop_duplicates()

zz['PERMIT_NAME'] = PERMIT_NAME
zz['PERMIT_STATUS_CODE'] = PERMIT_STATUS_CODE
zz['ACTIVITY_ID'] = ACTIVITY_ID

The idea here is that we are going to take your final output and loop through each subset on the ID. And join the unique codes so that they are strings and can be joined together into that singular value that you requested. You can remove the join if you want it to be an array.

ID  Facility_Code   FACILITY_NAME   Location_Address    PROG_CODE   PERMIT_NAME PERMIT_STATUS_CODE  ACTIVITY_ID
04R1    GAB Facility 1  HIGHWAY 1 E ABC PERMIT 1, permit1   A, C    1111, 1234
05R2    GAB Facility 2  1200 MOUNTAIN ROAD  ABC PERMIT 2    B   1111
05R7    VOR Facility 3  500 MARSH PASS      PERMIT 3    A, C    2000, 1234
0K09    FOP Facility 4  67 SEA LANE     permit4 C   1111

CodePudding user response:

Group by the unique columns and use agg() to combine rows:

df = df1.join([df2, df3])

df = df.groupby(['ID'
                ,'FACILITY_TYPE_CODE'
                ,'FACILITY_NAME'
                ,'LOCATION_ADDRESS']) \
       .agg(lambda s: ', '.join(s.fillna('')
                                 .unique()
                                 .astype('str')))

# Drop index for concise output.
print(df.reset_index(drop=True))
#   ACTIVITY_ID        PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0  1111, 1234  PERMIT 1, permit1               A, C       ABC
# 1        1111           PERMIT 2                  B       ABC
# 2  2000, 1234           PERMIT 3               A, C          
# 3        1111            permit4                  C          

Or simpler if you want to group the values into sets:

df = df1.join([df2, df3])

df = df.groupby(['ID'
                ,'FACILITY_TYPE_CODE'
                ,'FACILITY_NAME'
                ,'LOCATION_ADDRESS']) \
       .agg(set)

# Drop index for concise output.
print(df.reset_index(drop=True))
#     ACTIVITY_ID          PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0  {1234, 1111}  {PERMIT 1, permit1}             {A, C}     {ABC}
# 1        {1111}           {PERMIT 2}                {B}     {ABC}
# 2  {2000, 1234}           {PERMIT 3}             {A, C}     {nan}
# 3        {1111}            {permit4}                {C}     {nan}

For more reading: https://pandas.pydata.org/docs/user_guide/groupby.html


Sample data:

import io
import pandas as pd

facilities = io.StringIO("""
ID,FACILITY_TYPE_CODE,FACILITY_NAME,LOCATION_ADDRESS
04R1,GAB,Facility 1,HIGHWAY 1 E
05R2,GAB,Facility 2,1200 MOUNTAIN ROAD
05R7,VOR,Facility 3,500 MARSH PASS
0K09,FOP,Facility 4,67 SEA LANE
""")

permits = io.StringIO("""
ID,ACTIVITY_ID,PERMIT_NAME,PERMIT_STATUS_CODE
04R1,1111,PERMIT 1,A
04R1,1234,permit1,C
05R2,1111,PERMIT 2,B
05R7,2000,PERMIT 3,A
05R7,1234,PERMIT 3,C
0K09,1111,permit4,C
""")

programs = io.StringIO("""
ID,PROG_CODE
04R1,ABC
05R2,ABC
05R7,
0K09,
""")

df1 = pd.read_csv(facilities, index_col='ID')
df2 = pd.read_csv(permits, index_col='ID')
df3 = pd.read_csv(programs, index_col='ID')
  • Related