Home > OS >  How to merge multiple data frames?
How to merge multiple data frames?

Time:11-06

I'm looking to merge multiple data frames.

The main data frame consist of list of postal codes, and looks like so: enter image description here

Then I have multiple csv files with all post codes with its areas, which named subsequently (area0.csv, area1.csv, and etc. up to area26.csv) and each file looks like so: enter image description here

I'm aiming to build a new data frame that will consist area for each post code from data frame 1, and will look like so: enter image description here

CodePudding user response:

Aren't you just looking for merge?

Example

maindf=pd.DataFrame({'LocationPostCode':['BB2 6PS', 'LE11 2ER', 'H91 WR26']})
csv1=pd.DataFrame({'pcd':['WV3 0ZQ', 'LE11 2ER'], 'ru11ind':['A1', 'A1']}) # I assume you know how to read a csv, but this dataframe is reproducible
csv2=pd.DataFrame({'pcd':['A21 CV51', 'H91 WR26'], 'ru11ind':['A2', 'A3']})

# At first we know no 'ru11ind'
maindf['ru11ind']=pd.NA

# Merge aka build a dataframe from maindf and csv1, such as LocationPostCode of one is pcd of other. 
# The new dataframe will have 2 `ru11ind_x` and `ru11ind_y` columns, depending on where that `ru11ind` comes frome (from maindf or from csv1)
# With `how='left'` the new dataframe has the same rows as the old one, with `NA` in field `ru11ind_y` if a row of `maindf` matches no row of `csv1`
maindf=maindf.merge(csv1, left_on='LocationPostCode', right_on='pcd', how='left')
# Now maindf is
#  LocationPostCode ru11ind_x       pcd ru11ind_y
#0          BB2 6PS       NaN       NaN       NaN
#1         LE11 2ER       NaN  LE11 2ER        A1
#2         H91 WR26       NaN       NaN       NaN

# ru11ind_x is the column holding what ru11ind we knew before in maindf. Rename it
maindf.rename({'ru11ind_x':'ru11ind'}, axis=1, inplace=True)
# Now maindf is
#  LocationPostCode ru11ind       pcd ru11ind_y
#0          BB2 6PS     NaN       NaN       NaN
#1         LE11 2ER     NaN  LE11 2ER        A1
#2         H91 WR26     NaN       NaN       NaN

# What are the rows for which we learned something new?
f=maindf['ru11ind'].isna() & ~(maindf['ru11ind_y'].isna())

# For those rows, we copy `ru11ind_y` (the `ru11ind` from csv1) to `ru11ind` column
maindf.loc[f, 'ru11ind']=maindf[f]['ru11ind_y']
# Now maindf is
#  LocationPostCode ru11ind       pcd ru11ind_y
#0          BB2 6PS     NaN       NaN       NaN
#1         LE11 2ER      A1  LE11 2ER        A1
#2         H91 WR26     NaN       NaN       NaN

# We don't need neither `pcd` columns (which is just identical to LocationPostCode where it exists), nor `ru11ind_y`, from which we've copied anything interesting in `ru11ind`
maindf.drop(columns=['pcd', 'ru11ind_y'], inplace=True)
# Now maindf is
#  LocationPostCode ru11ind
#0          BB2 6PS     NaN
#1         LE11 2ER      A1
#2         H91 WR26     NaN


# Let's redo that with csv2
maindf=maindf.merge(csv2, left_on='LocationPostCode', right_on='pcd', how='left')
#  LocationPostCode ru11ind_x       pcd ru11ind_y
#0          BB2 6PS       NaN       NaN       NaN
#1         LE11 2ER        A1       NaN       NaN
#2         H91 WR26       NaN  H91 WR26        A3

maindf.rename({'ru11ind_x':'ru11ind'}, axis=1, inplace=True)
#  LocationPostCode ru11ind       pcd ru11ind_y
#0          BB2 6PS     NaN       NaN       NaN
#1         LE11 2ER      A1       NaN       NaN
#2         H91 WR26     NaN  H91 WR26        A3

f=maindf['ru11ind'].isna() & ~(maindf['ru11ind_y'].isna())
maindf.loc[f, 'ru11ind']=maindf[f]['ru11ind_y']
#  LocationPostCode ru11ind       pcd ru11ind_y
#0          BB2 6PS     NaN       NaN       NaN
#1         LE11 2ER      A1       NaN       NaN
#2         H91 WR26      A3  H91 WR26        A3

maindf.drop(columns=['pcd', 'ru11ind_y'], inplace=True)
#  LocationPostCode ru11ind
#0          BB2 6PS     NaN
#1         LE11 2ER      A1
#2         H91 WR26      A3

Obvioulsy, the first csv could be dealt with differently, since there is nothing interesting at first in ru11ind (so we could just merge without a ru11ind column, and that would create a brand new one with either NA or what is in csv1)

But the idea, clearly, is that this should be done iteratively over all csv. So, it is easier if we don't need to process the first csv differently to the others. In other words, my initial maindf['ru11ind']=pd.NA is what is called a sentinel (a dummy variable whose unique raison d'être is to ensure that first loop is not different than the other).

I let the rest (how to get content of csv, which are not part of the question, and how to do that iteratively) to you

CodePudding user response:

Here is where I got to:

# reading 'locations' file
df_loc = pd.read_excel('C:/Users/User/Desktop/Lloyds Pharma/locations.xlsx')

# saving results location
results_loc = 'C:/Users/User/Desktop/Lloyds Pharma/tmp_chunks/'

# OS folder path to chunck files for national statistics dataset
dir_path = r'C:\\Users\\User\\Desktop\\Lloyds Pharma\\tmp_chunks\\'

# list to store chunck files for national statistics dataset
files_list = []

# iterate directory
for path in os.listdir(dir_path):
    # check if current path is a file
    if os.path.isfile(os.path.join(dir_path, path)):
        files_list.append(path)


df_loc.to_csv(results_loc   'res.csv', index=False)

for i in files_list:
    
    res = pd.read_csv(results_loc   'res.csv')
    df = pd.read_csv(results_loc   i)

    inner_join = pd.merge(res,
                          df[['pcd','ru11ind']], 
                          left_on ='LocationPostCode (72 blanks, 43 duplicates, 1477 uniques)',
                          right_on ='pcd',
                          how ='left')

    inner_join.to_csv(results_loc   'res.csv', index=False)

But as expected, this results into multiple pcd's and ru11ind's, and looks like this:

enter image description here

How do I get all results in a single pcd, ru11ind columns? Obviously, pcd can be dropped as it is not required.

  • Related