Pandas - Reading in json, boolean is typed as float if it contains null value-CodePudding

I have a JSON file on S3 with the following data:

{"id": "1", "col1": null, "col2": "Test1", "col3": true}
{"id": "2", "col1": "abc", "col2": "Test2", "col3": null}
{"id": "3", "col1": "def", "col2": "Test3", "col3": true}

Note, col3 has boolean values in it, but one line has a null value. Here is my code:

import pandas as pd
source_path = 's3://bucket/prefix/file.json'
df = pd.read_json(source_path, lines=True)
df.dtypes

Output:

id        int64
col1     object
col2     object
col3    float64
dtype: object

df:

    id  col1    col2    col3
0   1   None    Test1   1.0
1   2   abc     Test2   NaN
2   3   def     Test3   1.0

Why is it that the col3 is typed as a float?

Col1 has a null but seems to correctly be typed as an object (string). Is there a generic way (not explicitly casting "col3" to boolean) to get the boolean type on the dataframe?

CodePudding user response：

Assuming OP consider null as false we can replace the None values of col3 with False

import json

data_str = '''{"id": "2", "col1": "abc", "col2": "Test2", "col3": null}'''
data = json.loads(data_str)
if not isinstance(data['col3'], bool):
    data['col3'] = False
print(data)

CodePudding user response：

For this situation, it seems that adding "dtype=False" to the read_json call gave me the desired behavior.

import pandas as pd
source_path = 's3://bucket/prefix/file.json'
df = pd.read_json(source_path, lines=True, dtype=False)
df.dtypes

Output:

id      object
col1    object
col2    object
col3    object
dtype: object

If I then write this dataframe to S3 as a parquet file, the output file is correctly typed as boolean.