Why does pandas.DataFrame.merge return dataframes with different column types than the input datafra-CodePudding

Slightly expanding the Example 1: Merge on Multiple Columns with Different Names, results in the following Python code using Pandas pandas.DataFrame.merge:

# Create and view first DataFrame:
df1 = pd.DataFrame({'a1': [0, 0, 1, 1, 2],
                    'b': [0, 0, 1, 1, 1],
                    'c': [11, 8, 10, 6, 6]})

print(df1)
print(df1.dtypes)
print()

# Create and view second DataFrame:
df2 = pd.DataFrame({'a2': [0, 1, 1, 1, 3],
                    'b': [0, 0, 0, 1, 1],
                    'd': [22, 24, 25, 33, 37]})
print(df2)
print(df2.dtypes)
print()

# Merge df1 and df2:
df_merge = pd.merge(df1, df2, how='left', left_on=['a1', 'b'], right_on=['a2', 'b'])
print(df_merge)
print(df_merge.dtypes)

The resulting output (I've added line numbers):

 1         a1  b   c
 2      0   0  0  11
 3      1   0  0   8
 4      2   1  1  10
 5      3   1  1   6
 6      4   2  1   6
 7      a1    int64
 8      b     int64
 9      c     int64
10      dtype: object
11      
12         a2  b   d
13      0   0  0  22
14      1   1  0  24
15      2   1  0  25
16      3   1  1  33
17      4   3  1  37
18      a2    int64
19      b     int64
20      d     int64
21      dtype: object
22      
23         a1  b   c         a2          d
24      0   0  0  11       0.00      22.00
25      1   0  0   8       0.00      22.00
26      2   1  1  10       1.00      33.00
27      3   1  1   6       1.00      33.00
28      4   2  1   6        NaN        NaN
29      a1      int64
30      b       int64
31      c       int64
32      a2    float64
33      d     float64
34      dtype: object

Notice the type of a2 and d columns in the resulting df_merge dataframe on lines 24 through 27 have changed from the original int64 to float64. Why would it need to change the types?

Even the example in the manual at df1.merge(df2, how='left', on='a') shows a 3.0 where I would have expected it to stay an int64:

df1.merge(df2, how='left', on='a')
      a  b  c
0   foo  1  3.0
1   bar  2  NaN

But it doesn't explain why. I see How to left merge two dataframes with nan without changing types from integer to float types indicates that NaN's seem to be a factor, but doesn't answer my specific question here as to why the type conversion happens.

If I change df1 to remove the last row:

df1 = pd.DataFrame({'a1': [0,  0,  1, 1, ],
                    'b':  [0,  0,  1, 1, ],
                    'c':  [11, 8, 10, 6, ]})

Then the output becomes what I would expect:

   a1  b   c  a2   d
0   0  0  11   0  22
1   0  0   8   0  22
2   1  1  10   1  33
3   1  1   6   1  33
a1    int64
b     int64
c     int64
a2    int64
d     int64
dtype: object

CodePudding user response：

The reason for this is that NaN is of type float.

import numpy as np

print(type(np.nan))
<class 'float'>

There are good reasons why this is the case which I can explain in comments if needed.

So when you do the merge and there are missing values filled with NaN then the column type will be automatically changed to float. This is because all rows of that column must be of the same data type.

If you specifically wanted to have int after the merge then you'd need to use the fillna method and define what integers should replace the missing values. E.g. people sometimes use -1 for counts. A simple example:

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'a': [1, 2, 3, np.nan, 5],
    'b': [1, np.nan, 3, 4, 5]
})
df = df.fillna(value=-1).astype(int)

print(df.dtypes)

Result is:

a    int64
b    int64
dtype: object

CodePudding user response：

It seems to be legacy where NaNs could be only part of type float. And pandas does an explicit conversion of the dataframe.

e.g. if you have have a dataframe like below :

>>> df1 = pd.DataFrame({'a1': [0,  0,  1, np.NaN ]})
>>> df1.astype(int)
ValueError: Cannot convert non-finite values (NA or inf) to integer

However with v0.24, if you use explicit dtype=pd.Int64Dtype(), it is now possible to hold NaNs in integers.

>>> df1 = pd.DataFrame({'a1': [0, 0, 1, 1, 2],
                    'b': [0, 0, 1, 1, 1],
                    'c': [11, 8, 10, 6, 6]}, dtype=pd.Int64Dtype())
>>> df2 = pd.DataFrame({'a2': [0, 1, 1, 1, 3],
                    'b': [0, 0, 0, 1, 1],
                    'd': [22, 24, 25, 33, 37]}, dtype=pd.Int64Dtype())
>>> df_merge = pd.merge(df1, df2, how='left', left_on=['a1', 'b'], right_on=['a2', 'b'])
>>> df_merge

    a1  b   c   a2  d
0   0   0   11  0   22
1   0   0   8   0   22
2   1   1   10  1   33
3   1   1   6   1   33
4   2   1   6   <NA>    <NA>

Note the values are still integers and NaN is now <NA>