Problem with storing and retrieving very large numbers in parquet format-CodePudding

I have a weird problem when storing number with large number of digits (18 digits) as parquet and retrieved. I get different values back. Further drilling, looks like this problem happens only when the input list is a mix of None and the actual values. The case when the list is free from None values, values are fetched back as expected.

I don't think its related to display problem. Tried display with unix command like cat, vi editor etc, so it doesn't look like a display problem.

There are 2 sections in the code,

Creates parquet from list with combination of None and large numbers. This is where the problem is. For ex: value : 235313013750949476 is changed to 235313013750949472 as seen in the output.
Creates parquet from list with just large numbers and no None values. It works as expected.

Code

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def get_row_list():
    row_list = []

    row_list.append(None)
    row_list.append(235313013750949476)
    row_list.append(None)
    row_list.append(135313013750949496)
    row_list.append(935313013750949406)
    row_list.append(835313013750949456)
    row_list.append(None)
    row_list.append(None)

    return row_list

def get_row_list_with_no_none():
    row_list = []

    row_list.append(235313013750949476)
    row_list.append(135313013750949496)
    row_list.append(935313013750949406)
    row_list.append(835313013750949456)

    return row_list

def create_parquet(row_list, col_list, parquet_filename):
    df = pd.DataFrame(row_list, columns=col_list)

    schema_field_list = [('tree_id', pa.int64())]
    pa_schema = pa.schema(schema_field_list)

    table = pa.Table.from_pandas(df, pa_schema)

    pq_writer = pq.ParquetWriter(parquet_filename,
                                 schema=pa_schema)

    pq_writer.write_table(table)
    pq_writer.close()

    print("Parquet file [%s] created" % parquet_filename)

def main():
    col_list = ['tree_id']

    # Row list without any none
    row_list = get_row_list_with_no_none()
    print (row_list)
    create_parquet(row_list, col_list, 'without_none.parquet')

    # Row list with none
    row_list = get_row_list()
    print (row_list)
    create_parquet(row_list, col_list, 'with_none.parquet')

# ==== Main code Execution =====
if __name__ == '__main__':
    main()

[Execution]

python test-parquet.py

[235313013750949476, 135313013750949496, 935313013750949406, 835313013750949456]
Parquet file [without_none.parquet] created
[None, 235313013750949476, None, 135313013750949496, 935313013750949406, 835313013750949456, None, None]
Parquet file [with_none.parquet] created

[Lib version]

pyarrow                  5.0.0
pandas                   1.1.5

python -v
Python 3.6.6

[Tested by consuming parquet as spark df]

>>> dfwithoutnone = spark.read.parquet("s3://some-bucket/without_none.parquet/")
>>> dfwithoutnone.count()
4
>>> dfwithoutnone.printSchema()
root
 |-- tree_id: long (nullable = true)

>>> dfwithoutnone.show(10, False)
 ------------------                                                             
|tree_id           |
 ------------------ 
|235313013750949476|
|135313013750949496|
|935313013750949406|
|835313013750949456|
 ------------------

>>> df_with_none = spark.read.parquet("s3://some-bucket/with_none.parquet/")
>>> df_with_none.count()
8                                                                               
>>> df_with_none.printSchema()
root
 |-- tree_id: long (nullable = true)

>>> df_with_none.printSchema()
root
 |-- tree_id: long (nullable = true)

>>> df_with_none.show(10, False)
 ------------------ 
|tree_id           |
 ------------------ 
|null              |
|235313013750949472|
|null              |
|135313013750949504|
|935313013750949376|
|835313013750949504|
|null              |
|null              |
 ------------------

I did search in StackOverflow, could not find anything appropriate. Can you please provide some pointers?

Thanks

CodePudding user response：

The problem isn't related with Parquet, but with your initial conversion of the row_list to a pandas DataFrame:

row_list = get_row_list()
col_list = ['tree_id']
df = pd.DataFrame(row_list, columns=col_list)

>>> df
        tree_id
0           NaN
1  2.353130e 17
2           NaN
3  1.353130e 17
4  9.353130e 17
5  8.353130e 17
6           NaN
7           NaN

Because there are missing values, pandas creates a float64 column. And it is this int -> float conversion that looses the precision for such large integers.
Later converting the float to an integer again (when creating the pyarrow Table with a schema that forces an integer column) will then result in a slightly different value, as can be seen doing this manually in python as well:

>>> row_list[1]
235313013750949476
>>> df.loc[1, "tree_id"]
2.3531301375094947e 17
>>> int(df.loc[1, "tree_id"])
235313013750949472

One possible solution is to avoid the temporary DataFrame. This will depend on your exact (real) use case of course, but if you start from a python list as in the reproducible example above, you can also create a pyarrow.Table directly from this list of values (pa.table({"tree_id": row_list}, schema=..) and this will preserve the exact values in the Parquet file.