Home > other >  Pandas - Reading in json, boolean is typed as float if it contains null value
Pandas - Reading in json, boolean is typed as float if it contains null value

Time:09-22

I have a JSON file on S3 with the following data:

{"id": "1", "col1": null, "col2": "Test1", "col3": true}
{"id": "2", "col1": "abc", "col2": "Test2", "col3": null}
{"id": "3", "col1": "def", "col2": "Test3", "col3": true}

Note, col3 has boolean values in it, but one line has a null value. Here is my code:

import pandas as pd
source_path = 's3://bucket/prefix/file.json'
df = pd.read_json(source_path, lines=True)
df.dtypes

Output:

id        int64
col1     object
col2     object
col3    float64
dtype: object

df:

    id  col1    col2    col3
0   1   None    Test1   1.0
1   2   abc     Test2   NaN
2   3   def     Test3   1.0

Why is it that the col3 is typed as a float?

Col1 has a null but seems to correctly be typed as an object (string). Is there a generic way (not explicitly casting "col3" to boolean) to get the boolean type on the dataframe?

CodePudding user response:

Assuming OP consider null as false we can replace the None values of col3 with False

import json

data_str = '''{"id": "2", "col1": "abc", "col2": "Test2", "col3": null}'''
data = json.loads(data_str)
if not isinstance(data['col3'], bool):
    data['col3'] = False
print(data)

CodePudding user response:

For this situation, it seems that adding "dtype=False" to the read_json call gave me the desired behavior.

import pandas as pd
source_path = 's3://bucket/prefix/file.json'
df = pd.read_json(source_path, lines=True, dtype=False)
df.dtypes

Output:

id      object
col1    object
col2    object
col3    object
dtype: object

If I then write this dataframe to S3 as a parquet file, the output file is correctly typed as boolean.

  • Related