I have a JSON file on S3 with the following data:
{"id": "1", "col1": null, "col2": "Test1", "col3": true}
{"id": "2", "col1": "abc", "col2": "Test2", "col3": null}
{"id": "3", "col1": "def", "col2": "Test3", "col3": true}
Note, col3 has boolean values in it, but one line has a null value. Here is my code:
import pandas as pd
source_path = 's3://bucket/prefix/file.json'
df = pd.read_json(source_path, lines=True)
df.dtypes
Output:
id int64
col1 object
col2 object
col3 float64
dtype: object
df:
id col1 col2 col3
0 1 None Test1 1.0
1 2 abc Test2 NaN
2 3 def Test3 1.0
Why is it that the col3 is typed as a float?
Col1 has a null but seems to correctly be typed as an object (string). Is there a generic way (not explicitly casting "col3" to boolean) to get the boolean type on the dataframe?
CodePudding user response:
Assuming OP consider null
as false
we can replace the None
values of col3
with False
import json
data_str = '''{"id": "2", "col1": "abc", "col2": "Test2", "col3": null}'''
data = json.loads(data_str)
if not isinstance(data['col3'], bool):
data['col3'] = False
print(data)
CodePudding user response:
For this situation, it seems that adding "dtype=False" to the read_json call gave me the desired behavior.
import pandas as pd
source_path = 's3://bucket/prefix/file.json'
df = pd.read_json(source_path, lines=True, dtype=False)
df.dtypes
Output:
id object
col1 object
col2 object
col3 object
dtype: object
If I then write this dataframe to S3 as a parquet file, the output file is correctly typed as boolean.