Slightly expanding the Example 1: Merge on Multiple Columns with Different Names, results in the following Python code using Pandas pandas.DataFrame.merge:
# Create and view first DataFrame:
df1 = pd.DataFrame({'a1': [0, 0, 1, 1, 2],
'b': [0, 0, 1, 1, 1],
'c': [11, 8, 10, 6, 6]})
print(df1)
print(df1.dtypes)
print()
# Create and view second DataFrame:
df2 = pd.DataFrame({'a2': [0, 1, 1, 1, 3],
'b': [0, 0, 0, 1, 1],
'd': [22, 24, 25, 33, 37]})
print(df2)
print(df2.dtypes)
print()
# Merge df1 and df2:
df_merge = pd.merge(df1, df2, how='left', left_on=['a1', 'b'], right_on=['a2', 'b'])
print(df_merge)
print(df_merge.dtypes)
The resulting output (I've added line numbers):
1 a1 b c
2 0 0 0 11
3 1 0 0 8
4 2 1 1 10
5 3 1 1 6
6 4 2 1 6
7 a1 int64
8 b int64
9 c int64
10 dtype: object
11
12 a2 b d
13 0 0 0 22
14 1 1 0 24
15 2 1 0 25
16 3 1 1 33
17 4 3 1 37
18 a2 int64
19 b int64
20 d int64
21 dtype: object
22
23 a1 b c a2 d
24 0 0 0 11 0.00 22.00
25 1 0 0 8 0.00 22.00
26 2 1 1 10 1.00 33.00
27 3 1 1 6 1.00 33.00
28 4 2 1 6 NaN NaN
29 a1 int64
30 b int64
31 c int64
32 a2 float64
33 d float64
34 dtype: object
Notice the type of a2
and d
columns in the resulting df_merge
dataframe on lines 24 through 27 have changed from the original int64
to float64
. Why would it need to change the types?
Even the example in the manual at df1.merge(df2, how='left', on='a') shows a 3.0
where I would have expected it to stay an int64
:
df1.merge(df2, how='left', on='a')
a b c
0 foo 1 3.0
1 bar 2 NaN
But it doesn't explain why. I see How to left merge two dataframes with nan without changing types from integer to float types indicates that NaN
's seem to be a factor, but doesn't answer my specific question here as to why the type conversion happens.
If I change df1
to remove the last row:
df1 = pd.DataFrame({'a1': [0, 0, 1, 1, ],
'b': [0, 0, 1, 1, ],
'c': [11, 8, 10, 6, ]})
Then the output becomes what I would expect:
a1 b c a2 d
0 0 0 11 0 22
1 0 0 8 0 22
2 1 1 10 1 33
3 1 1 6 1 33
a1 int64
b int64
c int64
a2 int64
d int64
dtype: object
CodePudding user response:
The reason for this is that NaN
is of type float.
import numpy as np
print(type(np.nan))
<class 'float'>
There are good reasons why this is the case which I can explain in comments if needed.
So when you do the merge and there are missing values filled with NaN
then the column type will be automatically changed to float
. This is because all rows of that column must be of the same data type.
If you specifically wanted to have int
after the merge then you'd need to use the fillna
method and define what integers should replace the missing values. E.g. people sometimes use -1 for counts. A simple example:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'a': [1, 2, 3, np.nan, 5],
'b': [1, np.nan, 3, 4, 5]
})
df = df.fillna(value=-1).astype(int)
print(df.dtypes)
Result is:
a int64
b int64
dtype: object
CodePudding user response:
It seems to be legacy where NaNs
could be only part of type float
. And pandas does an explicit conversion of the dataframe.
e.g. if you have have a dataframe like below :
>>> df1 = pd.DataFrame({'a1': [0, 0, 1, np.NaN ]})
>>> df1.astype(int)
ValueError: Cannot convert non-finite values (NA or inf) to integer
However with v0.24
, if you use explicit dtype=pd.Int64Dtype()
, it is now possible to hold NaNs
in integers.
>>> df1 = pd.DataFrame({'a1': [0, 0, 1, 1, 2],
'b': [0, 0, 1, 1, 1],
'c': [11, 8, 10, 6, 6]}, dtype=pd.Int64Dtype())
>>> df2 = pd.DataFrame({'a2': [0, 1, 1, 1, 3],
'b': [0, 0, 0, 1, 1],
'd': [22, 24, 25, 33, 37]}, dtype=pd.Int64Dtype())
>>> df_merge = pd.merge(df1, df2, how='left', left_on=['a1', 'b'], right_on=['a2', 'b'])
>>> df_merge
a1 b c a2 d
0 0 0 11 0 22
1 0 0 8 0 22
2 1 1 10 1 33
3 1 1 6 1 33
4 2 1 6 <NA> <NA>
Note the values are still integers and NaN
is now <NA>