How to keep data types when reading from s3 bucket using pandas-CodePudding

Im using a Lambda function to read a csv from s3 and insert the data into a table. When reading from s3 I've noticed that when I read it to disk and the read it in pandas any field that an int and starts with a 0 is getting turned into a float and any leading 0 is getting removed. How do keep the int values as ints when reading from s3 bucket?

for example if I have a field account_number with the value 012345678 when I read in from s3 it becomes 1234567.0, turning my integer to Float64. This is causing downstream problems. How do I get around this issue.

Sample code:

export_csv = client.get_object(Bucket = s3_bucket, Key = bo_file_name)
bo_body = export_csv['Body']
csv_string = bo_body.read().decode('utf-8')
bo_df = pd.read_csv(StringIO(csv_string), sep = '|', quotechar = '"')
#Replacing NaN values with Nulls.
print('removing nulls from df')
bo_df.fillna('', inplace = True)
sql_server = os.getenv('DB_ENDPOINT')

sample schema after reading into pandas:

first_name                                object
middle_initial                            object
last_name                                 object
date_of_birth                             object
actnum                                   float64

CodePudding user response：

The issue here is that your account 'number' is really a string that happens to be all numeric (if that makes sense) - it has no mathematical meaning, but pandas thinks it does, so it's treating it as a float.

The solution is to tell Pandas which datatype to use, to over-ride its (usually) good intention of inferring it.

Here's a reproducible walkthrough:

import pandas as pd
data = ["01", "02", "03", "10", "11", "", None]
someDF = pd.DataFrame(data)
someDF.to_csv("accts.csv", index=False)

Now read it back:

pd.read_csv("accts.csv")

You get this:

Not what you had in mind.

Now try it by specifying the datatype explicitly:

importedDF = pd.read_csv("accts.csv", names=["Account"], header=0, dtype=str).fillna(value="")
print(importedDF)

     Account
0    01
1    02
2    03
3    10
4    11
5      
6

Notes:

some good info here on the background of the problem of integer columns with nulls getting auto-converted to floats: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
if you pass in a dict of column names and datatypes, you can pick and choose for which columns to control the datatype - details are here: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html