How do I merge two CSV files and retain all columns and rows with Python/pandas?-CodePudding

I have two csv files that I want to merge and retain all columns. I want to match on the common fields 'CORPORATION', 'COMPANY_NAME', 'STATE_NAME', 'SITE', 'PADD'. The most important field in the left csv is 'QUANTITY' and the most important fields in the right csv are Latitude,Longitude. I have tried using pandas.merge() with how='outer', how='left', and how='right', but each option winds up giving me NaN values for at least some of the fields I need. I have also tried using pandas.concat(), but the values I need to match are not all in the same order, and that just appended the files in the order they were listed, so I had mismatching state names etc.

The files look like the following

refcap22_trimmed.csv:

CORPORATION,SURVEY,PERIOD,COMPANY_NAME,RDIST_LABEL,STATE_NAME,SITE,PADD,PRODUCT,SUPPLY,QUANTITY
DELEK GROUP LTD,820,22,ALON REFINING KROTZ SPRINGS INC,Louisiana Gulf Coast,Louisiana,KROTZ SPRINGS,3,TOTAL OPERABLE CAPACITY,Atmospheric Crude Distillation Capacity (barrels per stream day),83000
...

Petroleum_Refineries2019EIA.csv:

X,Y,FID,COMPANY_NAME,CORPORATION,SITE,STATE_NAME,PADD,AD_Mbpd,VDist_Mbpd,CaDis_Mbpd,HyCrk_Mbpd,VRedu_Mbpd,CaRef_Mbpd,Isal_Mbpd,HDS_Mbpd,Cokin_Mbpd,Asph_Mbpd,Source,Period_,Latitude,Longitude,Link_Imports,Process_Sum
-91.748001,30.526508,105,ALON REFINING KROTZ SPRINGS INC,DELEK GROUP LTD,Krotz Springs,Louisiana,3,83,36.2,34,,,13,12.2,32,,,EIA-820 Refinery Capacity Report,As of Jan. 1 2021,30.526508,-91.748001,1,210.4
...

The desired output would be something like: output.csv:

CORPORATION,SURVEY,PERIOD,COMPANY_NAME,RDIST_LABEL,STATE_NAME,SITE,PADD,PRODUCT,SUPPLY,QUANTITY,X,Y,FID,AD_Mbpd,VDist_Mbpd,CaDis_Mbpd,HyCrk_Mbpd,VRedu_Mbpd,CaRef_Mbpd,Isal_Mbpd,HDS_Mbpd,Cokin_Mbpd,Asph_Mbpd,Source,Period_,Latitude,Longitude,Link_Imports,Process_Sum
DELEK GROUP LTD,820,22,ALON REFINING KROTZ SPRINGS INC,Louisiana Gulf Coast,Louisiana,KROTZ SPRINGS,3,TOTAL OPERABLE CAPACITY,Atmospheric Crude Distillation Capacity (barrels per stream day),83000,-93.88847,29.96046,1,TOTAL SA,3,245,117,80,,,43,15.3,274.5,60,,EIA-820 Refinery Capacity Report,As of Jan. 1 2021,29.96046,-93.88847,1,834.8
...

I have tried the code below:

import pandas as pd

refcap = pd.read_csv('../refcap22_trimmed.csv')
eia2019 = pd.read_csv('../Petroleum_Refineries2019EIA.csv')

output = pd.merge(refcap, eia2019, on=['CORPORATION', 'COMPANY_NAME', 'STATE_NAME', 'SITE', 'PADD'], how='right')

print(output.iloc[1])

and got the following result:

CORPORATION                                       DELEK GROUP LTD
SURVEY                                                      820.0
PERIOD                                                       22.0
COMPANY_NAME                                  ALON USA ENERGY INC
RDIST_LABEL                                          Texas Inland
STATE_NAME                                                  Texas
SITE                                                   BIG SPRING
PADD                                                          3.0
PRODUCT                                   TOTAL OPERABLE CAPACITY
SUPPLY          Atmospheric Crude Distillation Capacity (barre...
QUANTITY                                                  74000.0
X                                                             NaN
Y                                                             NaN
FID                                                           NaN
AD_Mbpd                                                       NaN
VDist_Mbpd                                                    NaN
CaDis_Mbpd                                                    NaN
HyCrk_Mbpd                                                    NaN
VRedu_Mbpd                                                    NaN
CaRef_Mbpd                                                    NaN
Isal_Mbpd                                                     NaN
HDS_Mbpd                                                      NaN
Cokin_Mbpd                                                    NaN
Asph_Mbpd                                                     NaN
Source                                                        NaN
Period_                                                       NaN
Latitude                                                      NaN
Longitude                                                     NaN
Link_Imports                                                  NaN
Process_Sum                                                   NaN
Name: 1, dtype: object

But the result I need would look something like:

CORPORATION                                       DELEK GROUP LTD
SURVEY                                                      820.0
PERIOD                                                       22.0
COMPANY_NAME                                  ALON USA ENERGY INC
RDIST_LABEL                                          Texas Inland
STATE_NAME                                                  Texas
SITE                                                   BIG SPRING
PADD                                                          3.0
PRODUCT                                   TOTAL OPERABLE CAPACITY
SUPPLY          Atmospheric Crude Distillation Capacity (barre...
QUANTITY                                                  74000.0
X                                                             -91.748001
Y                                                             30.526508
FID                                                           105
AD_Mbpd                                                       83
VDist_Mbpd                                                    36.2
CaDis_Mbpd                                                    34
HyCrk_Mbpd                                                    NaN
VRedu_Mbpd                                                    NaN
CaRef_Mbpd                                                    13
Isal_Mbpd                                                     12.2
HDS_Mbpd                                                      32
Cokin_Mbpd                                                    NaN
Asph_Mbpd                                                     NaN
Source                                                        EIA-820 RefCap Report
Period_                                                       As of Jan. 1 2021
Latitude                                                      30.526508
Longitude                                                     -91.748001
Link_Imports                                                  1
Process_Sum                                                   NaN
Name: 1, dtype: object

How do I accomplish the desired output?

CodePudding user response：

The values in the fields you merge on must be equal to achieve the expected consolidation. In the sample data you provide, the SITE field between the two are not equal (KROTZ SPRINGS vs. Krotz Springs), so it appears as a separate record to the merge function and is preserved as such.

outdf = df1.merge(df2, how='outer', on=['CORPORATION', 'COMPANY_NAME', 'STATE_NAME', 'PADD'], suffixes=('_df1','_df2'))

Omitting the 'SITE' key, yields the desired result with suffixes '_df1' and '_df2' indicating the overlapping columns.

|              | 0                                                                |
|:-------------|:-----------------------------------------------------------------|
| CORPORATION  | DELEK GROUP LTD                                                  |
| SURVEY       | 820                                                              |
| PERIOD       | 22                                                               |
| COMPANY_NAME | ALON REFINING KROTZ SPRINGS INC                                  |
| RDIST_LABEL  | Louisiana Gulf Coast                                             |
| STATE_NAME   | Louisiana                                                        |
| SITE_df1     | KROTZ SPRINGS                                                    |
| PADD         | 3                                                                |
| PRODUCT      | TOTAL OPERABLE CAPACITY                                          |
| SUPPLY       | Atmospheric Crude Distillation Capacity (barrels per stream day) |
| QUANTITY     | 83000                                                            |
| X            | -91.748001                                                       |
| Y            | 30.526508                                                        |
| FID          | 105                                                              |
| SITE_df2     | Krotz Springs                                                    |
| AD_Mbpd      | 83                                                               |
| VDist_Mbpd   | 36.2                                                             |
| CaDis_Mbpd   | 34                                                               |
| HyCrk_Mbpd   | nan                                                              |
| VRedu_Mbpd   | nan                                                              |
| CaRef_Mbpd   | 13                                                               |
| Isal_Mbpd    | 12.2                                                             |
| HDS_Mbpd     | 32                                                               |
| Cokin_Mbpd   | nan                                                              |
| Asph_Mbpd    | nan                                                              |
| Source       | EIA-820 Refinery Capacity Report                                 |
| Period_      | As of Jan. 1 2021                                                |
| Latitude     | 30.526508                                                        |
| Longitude    | -91.748001                                                       |
| Link_Imports | 1                                                                |
| Process_Sum  | 210.4                                                            |

It's good practice to clean up your raw data before combining, especially merge keys, if you believe it could be inconsistent. Something simple in this case would be..

#read in data
#inspect data    

#clean up data 
fields=['CORPORATION', 'COMPANY_NAME', 'SITE', 'STATE_NAME']

for field in fields:
    df1[field] = df1[field].str.strip()
    df2[field] = df2[field].str.strip()
    df1[field] = df1[field].str.casefold()
    df2[field] = df2[field].str.casefold()

#merge clean data