I am trying to merge the following two dataframes on the date
price
and code
columns, but there is unexpected loss of rows:
df1.head(5)
date price code
0 2015-01-03 27.65 534
1 2015-03-30 43.65 231
2 2015-01-09 24.65 132
3 2015-11-13 87.13 211
4 2015-05-12 63.25 231
df2.head(5)
date price code volume
0 2015-01-03 27.65 534 43
1 2015-03-30 43.65 231 21
2 2015-01-09 24.65 132 88
3 2015-12-25 0.00 211 130
4 2015-03-12 11.15 211 99
df1
is a subset of df2
, but without the volume
column.
I tried the following code, but half of the rows disappear:
master = pd.merge(df1, df2, on=['date', 'price', 'code'])
I know I can do a left join and keep all the rows, but they will be NaN for volume
, which is the entire purpose of doing the join.
I'm not sure why there are missing rows after the join, as df2
contains every single date, with consistent prices and codes as with df1
. They also have the same data types, where dates are strings.
df1.dtypes
date object
price float64
code int
dtype: object
df2.dtypes
date object
price float64
code int
volume int
dtype: object
When I check if a certain value exists in a dataframe, it returns False, which could be a clue as to why they don't join as expected (unless I'm checking wrong):
df2['price'][0]
> 27.65
27.65 in df2['price']
> False
I'm at a loss about what to check next - any ideas? I think there could be some sort of encoding issues or something along those lines.
I also manually checked the dataframes for the rows which weren't able to be joined and the values definitely exist.
Edit: I know it's expected for the rows to be dropped when they do not match on the columns specified in the join. The problem is that rows are dropped even when they do match (or they appear to be).
CodePudding user response:
You say "df1 is a subset of df2, but without the volume column". I suppose you don't mean that their only difference is the "volume" column, which is obviously not the only difference, otherwise the question would be trivial (of just adding the "volume" column).
You just need to add the how='outer' keyword argument.
df = pd.merge(df1, df2, on=['date', 'code', 'price'], how='outer')
CodePudding user response:
With your dataframe, since your column naming is consistent, you can do the following:
master = pd.merge(df1,df2, how='left')
By default, pd.merge()
joins on all common columns and performs an inner join. Use how
to specify the left join.
CodePudding user response:
The value disappears because row 3 and 4 for column price
has different value (Also, row 4 for the code
column). merge
default behavior is inner
therefore those rows will disappear.