Home > Software design >  Missing rows when merging pandas dataframes?
Missing rows when merging pandas dataframes?

Time:01-10

I am trying to merge the following two dataframes on the date price and code columns, but there is unexpected loss of rows:

df1.head(5)

    date        price   code       
0   2015-01-03  27.65   534
1   2015-03-30  43.65   231
2   2015-01-09  24.65   132
3   2015-11-13  87.13   211
4   2015-05-12  63.25   231

df2.head(5)

    date        price   code   volume
0   2015-01-03  27.65   534    43
1   2015-03-30  43.65   231    21
2   2015-01-09  24.65   132    88
3   2015-12-25  0.00    211    130
4   2015-03-12  11.15   211    99

df1 is a subset of df2, but without the volume column.

I tried the following code, but half of the rows disappear:

master = pd.merge(df1, df2, on=['date', 'price', 'code'])

I know I can do a left join and keep all the rows, but they will be NaN for volume, which is the entire purpose of doing the join.

I'm not sure why there are missing rows after the join, as df2 contains every single date, with consistent prices and codes as with df1. They also have the same data types, where dates are strings.

df1.dtypes

date                  object
price                 float64
code                  int
dtype: object

df2.dtypes

date                  object
price                 float64
code                  int
volume                int
dtype: object

When I check if a certain value exists in a dataframe, it returns False, which could be a clue as to why they don't join as expected (unless I'm checking wrong):

df2['price'][0]
> 27.65

27.65 in df2['price']
> False

I'm at a loss about what to check next - any ideas? I think there could be some sort of encoding issues or something along those lines.

I also manually checked the dataframes for the rows which weren't able to be joined and the values definitely exist.

Edit: I know it's expected for the rows to be dropped when they do not match on the columns specified in the join. The problem is that rows are dropped even when they do match (or they appear to be).

CodePudding user response:

You say "df1 is a subset of df2, but without the volume column". I suppose you don't mean that their only difference is the "volume" column, which is obviously not the only difference, otherwise the question would be trivial (of just adding the "volume" column).

You just need to add the how='outer' keyword argument.

df = pd.merge(df1, df2, on=['date', 'code', 'price'], how='outer')

CodePudding user response:

With your dataframe, since your column naming is consistent, you can do the following:

master = pd.merge(df1,df2, how='left')

By default, pd.merge() joins on all common columns and performs an inner join. Use how to specify the left join.

CodePudding user response:

The value disappears because row 3 and 4 for column price has different value (Also, row 4 for the code column). merge default behavior is inner therefore those rows will disappear.

  • Related