I am trying to join certain fields from multiple .CSV files together based on a key field. However, in instances when there are duplicates, I want to append the information to an existing field.
Example of DF data and Results/Desired Results
Does anyone know a method of doing this?
This is an example of the current code I have, which works to use specified fields and join data, but leads to duplicate entries:
DF1 = pd.read_csv(('facilities.csv'), header = 0, dtype = object)
DF2 = pd.read_csv(('permits.csv'), header = 0, dtype = object)
DF3 = pd.read_csv(('programs.csv'), header = 0, dtype = object)
# Select only necessary columns from CSVs
DF1_reduc = DF1[['ID','FACILITY_TYPE_CODE','FACILITY_NAME','LOCATION_ADDRESS']]
DF2_reduc = DF2[['ID','ACTIVITY_ID','PERMIT_NAME','PERMIT_STATUS_CODE']]
DF3_reduc = DF3[['ID','PROG_CODE']]
#Joining all tables together
joined_tables = [DF1_reduc, DF2_reduc, DF3_reduc]
joined_tables = [table.set_index('ID') for table in joined_tables]
joined_tables = joined_tables[0].join(joined_tables[1:])
CodePudding user response:
import pandas as pd
PERMIT_NAME = []
PERMIT_STATUS_CODE = []
ACTIVITY_ID = []
df['ACTIVITY_ID'] = df['ACTIVITY_ID'].apply(lambda x: str(x))
for ID in df.ID.unique():
subset = df[df["ID"] == ID]
PERMIT_NAME.append(", ".join(subset['PERMIT_NAME'].unique() ))
PERMIT_STATUS_CODE.append( ", ".join(subset['PERMIT_STATUS_CODE'].unique() ))
ACTIVITY_ID.append(", ".join( subset['ACTIVITY_ID'].unique() ))
zz = df.drop(['PERMIT_NAME', 'PERMIT_STATUS_CODE', 'ACTIVITY_ID'], axis = 1).drop_duplicates()
zz['PERMIT_NAME'] = PERMIT_NAME
zz['PERMIT_STATUS_CODE'] = PERMIT_STATUS_CODE
zz['ACTIVITY_ID'] = ACTIVITY_ID
The idea here is that we are going to take your final output and loop through each subset on the ID. And join the unique codes so that they are strings and can be joined together into that singular value that you requested. You can remove the join if you want it to be an array.
ID Facility_Code FACILITY_NAME Location_Address PROG_CODE PERMIT_NAME PERMIT_STATUS_CODE ACTIVITY_ID
04R1 GAB Facility 1 HIGHWAY 1 E ABC PERMIT 1, permit1 A, C 1111, 1234
05R2 GAB Facility 2 1200 MOUNTAIN ROAD ABC PERMIT 2 B 1111
05R7 VOR Facility 3 500 MARSH PASS PERMIT 3 A, C 2000, 1234
0K09 FOP Facility 4 67 SEA LANE permit4 C 1111
CodePudding user response:
Group by the unique columns and use agg() to combine rows:
df = df1.join([df2, df3])
df = df.groupby(['ID'
,'FACILITY_TYPE_CODE'
,'FACILITY_NAME'
,'LOCATION_ADDRESS']) \
.agg(lambda s: ', '.join(s.fillna('')
.unique()
.astype('str')))
# Drop index for concise output.
print(df.reset_index(drop=True))
# ACTIVITY_ID PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0 1111, 1234 PERMIT 1, permit1 A, C ABC
# 1 1111 PERMIT 2 B ABC
# 2 2000, 1234 PERMIT 3 A, C
# 3 1111 permit4 C
Or simpler if you want to group the values into sets:
df = df1.join([df2, df3])
df = df.groupby(['ID'
,'FACILITY_TYPE_CODE'
,'FACILITY_NAME'
,'LOCATION_ADDRESS']) \
.agg(set)
# Drop index for concise output.
print(df.reset_index(drop=True))
# ACTIVITY_ID PERMIT_NAME PERMIT_STATUS_CODE PROG_CODE
# 0 {1234, 1111} {PERMIT 1, permit1} {A, C} {ABC}
# 1 {1111} {PERMIT 2} {B} {ABC}
# 2 {2000, 1234} {PERMIT 3} {A, C} {nan}
# 3 {1111} {permit4} {C} {nan}
For more reading: https://pandas.pydata.org/docs/user_guide/groupby.html
Sample data:
import io
import pandas as pd
facilities = io.StringIO("""
ID,FACILITY_TYPE_CODE,FACILITY_NAME,LOCATION_ADDRESS
04R1,GAB,Facility 1,HIGHWAY 1 E
05R2,GAB,Facility 2,1200 MOUNTAIN ROAD
05R7,VOR,Facility 3,500 MARSH PASS
0K09,FOP,Facility 4,67 SEA LANE
""")
permits = io.StringIO("""
ID,ACTIVITY_ID,PERMIT_NAME,PERMIT_STATUS_CODE
04R1,1111,PERMIT 1,A
04R1,1234,permit1,C
05R2,1111,PERMIT 2,B
05R7,2000,PERMIT 3,A
05R7,1234,PERMIT 3,C
0K09,1111,permit4,C
""")
programs = io.StringIO("""
ID,PROG_CODE
04R1,ABC
05R2,ABC
05R7,
0K09,
""")
df1 = pd.read_csv(facilities, index_col='ID')
df2 = pd.read_csv(permits, index_col='ID')
df3 = pd.read_csv(programs, index_col='ID')