I have two csv files that I want to merge and retain all columns. I want to match on the common fields 'CORPORATION', 'COMPANY_NAME', 'STATE_NAME', 'SITE', 'PADD'
. The most important field in the left csv is 'QUANTITY'
and the most important fields in the right csv are Latitude,Longitude
. I have tried using pandas.merge()
with how='outer'
, how='left'
, and how='right'
, but each option winds up giving me NaN
values for at least some of the fields I need. I have also tried using pandas.concat()
, but the values I need to match are not all in the same order, and that just appended the files in the order they were listed, so I had mismatching state names etc.
The files look like the following
refcap22_trimmed.csv:
CORPORATION,SURVEY,PERIOD,COMPANY_NAME,RDIST_LABEL,STATE_NAME,SITE,PADD,PRODUCT,SUPPLY,QUANTITY
DELEK GROUP LTD,820,22,ALON REFINING KROTZ SPRINGS INC,Louisiana Gulf Coast,Louisiana,KROTZ SPRINGS,3,TOTAL OPERABLE CAPACITY,Atmospheric Crude Distillation Capacity (barrels per stream day),83000
...
Petroleum_Refineries2019EIA.csv:
X,Y,FID,COMPANY_NAME,CORPORATION,SITE,STATE_NAME,PADD,AD_Mbpd,VDist_Mbpd,CaDis_Mbpd,HyCrk_Mbpd,VRedu_Mbpd,CaRef_Mbpd,Isal_Mbpd,HDS_Mbpd,Cokin_Mbpd,Asph_Mbpd,Source,Period_,Latitude,Longitude,Link_Imports,Process_Sum
-91.748001,30.526508,105,ALON REFINING KROTZ SPRINGS INC,DELEK GROUP LTD,Krotz Springs,Louisiana,3,83,36.2,34,,,13,12.2,32,,,EIA-820 Refinery Capacity Report,As of Jan. 1 2021,30.526508,-91.748001,1,210.4
...
The desired output would be something like: output.csv:
CORPORATION,SURVEY,PERIOD,COMPANY_NAME,RDIST_LABEL,STATE_NAME,SITE,PADD,PRODUCT,SUPPLY,QUANTITY,X,Y,FID,AD_Mbpd,VDist_Mbpd,CaDis_Mbpd,HyCrk_Mbpd,VRedu_Mbpd,CaRef_Mbpd,Isal_Mbpd,HDS_Mbpd,Cokin_Mbpd,Asph_Mbpd,Source,Period_,Latitude,Longitude,Link_Imports,Process_Sum
DELEK GROUP LTD,820,22,ALON REFINING KROTZ SPRINGS INC,Louisiana Gulf Coast,Louisiana,KROTZ SPRINGS,3,TOTAL OPERABLE CAPACITY,Atmospheric Crude Distillation Capacity (barrels per stream day),83000,-93.88847,29.96046,1,TOTAL SA,3,245,117,80,,,43,15.3,274.5,60,,EIA-820 Refinery Capacity Report,As of Jan. 1 2021,29.96046,-93.88847,1,834.8
...
I have tried the code below:
import pandas as pd
refcap = pd.read_csv('../refcap22_trimmed.csv')
eia2019 = pd.read_csv('../Petroleum_Refineries2019EIA.csv')
output = pd.merge(refcap, eia2019, on=['CORPORATION', 'COMPANY_NAME', 'STATE_NAME', 'SITE', 'PADD'], how='right')
print(output.iloc[1])
and got the following result:
CORPORATION DELEK GROUP LTD
SURVEY 820.0
PERIOD 22.0
COMPANY_NAME ALON USA ENERGY INC
RDIST_LABEL Texas Inland
STATE_NAME Texas
SITE BIG SPRING
PADD 3.0
PRODUCT TOTAL OPERABLE CAPACITY
SUPPLY Atmospheric Crude Distillation Capacity (barre...
QUANTITY 74000.0
X NaN
Y NaN
FID NaN
AD_Mbpd NaN
VDist_Mbpd NaN
CaDis_Mbpd NaN
HyCrk_Mbpd NaN
VRedu_Mbpd NaN
CaRef_Mbpd NaN
Isal_Mbpd NaN
HDS_Mbpd NaN
Cokin_Mbpd NaN
Asph_Mbpd NaN
Source NaN
Period_ NaN
Latitude NaN
Longitude NaN
Link_Imports NaN
Process_Sum NaN
Name: 1, dtype: object
But the result I need would look something like:
CORPORATION DELEK GROUP LTD
SURVEY 820.0
PERIOD 22.0
COMPANY_NAME ALON USA ENERGY INC
RDIST_LABEL Texas Inland
STATE_NAME Texas
SITE BIG SPRING
PADD 3.0
PRODUCT TOTAL OPERABLE CAPACITY
SUPPLY Atmospheric Crude Distillation Capacity (barre...
QUANTITY 74000.0
X -91.748001
Y 30.526508
FID 105
AD_Mbpd 83
VDist_Mbpd 36.2
CaDis_Mbpd 34
HyCrk_Mbpd NaN
VRedu_Mbpd NaN
CaRef_Mbpd 13
Isal_Mbpd 12.2
HDS_Mbpd 32
Cokin_Mbpd NaN
Asph_Mbpd NaN
Source EIA-820 RefCap Report
Period_ As of Jan. 1 2021
Latitude 30.526508
Longitude -91.748001
Link_Imports 1
Process_Sum NaN
Name: 1, dtype: object
How do I accomplish the desired output?
CodePudding user response:
The values in the fields you merge on must be equal to achieve the expected consolidation. In the sample data you provide, the SITE field between the two are not equal (KROTZ SPRINGS vs. Krotz Springs), so it appears as a separate record to the merge function and is preserved as such.
outdf = df1.merge(df2, how='outer', on=['CORPORATION', 'COMPANY_NAME', 'STATE_NAME', 'PADD'], suffixes=('_df1','_df2'))
Omitting the 'SITE' key, yields the desired result with suffixes '_df1' and '_df2' indicating the overlapping columns.
| | 0 |
|:-------------|:-----------------------------------------------------------------|
| CORPORATION | DELEK GROUP LTD |
| SURVEY | 820 |
| PERIOD | 22 |
| COMPANY_NAME | ALON REFINING KROTZ SPRINGS INC |
| RDIST_LABEL | Louisiana Gulf Coast |
| STATE_NAME | Louisiana |
| SITE_df1 | KROTZ SPRINGS |
| PADD | 3 |
| PRODUCT | TOTAL OPERABLE CAPACITY |
| SUPPLY | Atmospheric Crude Distillation Capacity (barrels per stream day) |
| QUANTITY | 83000 |
| X | -91.748001 |
| Y | 30.526508 |
| FID | 105 |
| SITE_df2 | Krotz Springs |
| AD_Mbpd | 83 |
| VDist_Mbpd | 36.2 |
| CaDis_Mbpd | 34 |
| HyCrk_Mbpd | nan |
| VRedu_Mbpd | nan |
| CaRef_Mbpd | 13 |
| Isal_Mbpd | 12.2 |
| HDS_Mbpd | 32 |
| Cokin_Mbpd | nan |
| Asph_Mbpd | nan |
| Source | EIA-820 Refinery Capacity Report |
| Period_ | As of Jan. 1 2021 |
| Latitude | 30.526508 |
| Longitude | -91.748001 |
| Link_Imports | 1 |
| Process_Sum | 210.4 |
It's good practice to clean up your raw data before combining, especially merge keys, if you believe it could be inconsistent. Something simple in this case would be..
#read in data
#inspect data
#clean up data
fields=['CORPORATION', 'COMPANY_NAME', 'SITE', 'STATE_NAME']
for field in fields:
df1[field] = df1[field].str.strip()
df2[field] = df2[field].str.strip()
df1[field] = df1[field].str.casefold()
df2[field] = df2[field].str.casefold()
#merge clean data