Im using a Lambda function to read a csv from s3 and insert the data into a table. When reading from s3 I've noticed that when I read it to disk and the read it in pandas any field that an int and starts with a 0
is getting turned into a float and any leading 0
is getting removed. How do keep the int values as ints when reading from s3 bucket?
for example if I have a field account_number
with the value 012345678
when I read in from s3 it becomes 1234567.0
, turning my integer to Float64
. This is causing downstream problems. How do I get around this issue.
Sample code:
export_csv = client.get_object(Bucket = s3_bucket, Key = bo_file_name)
bo_body = export_csv['Body']
csv_string = bo_body.read().decode('utf-8')
bo_df = pd.read_csv(StringIO(csv_string), sep = '|', quotechar = '"')
#Replacing NaN values with Nulls.
print('removing nulls from df')
bo_df.fillna('', inplace = True)
sql_server = os.getenv('DB_ENDPOINT')
sample schema after reading into pandas
:
first_name object
middle_initial object
last_name object
date_of_birth object
actnum float64
CodePudding user response:
The issue here is that your account 'number' is really a string that happens to be all numeric (if that makes sense) - it has no mathematical meaning, but pandas thinks it does, so it's treating it as a float.
The solution is to tell Pandas which datatype to use, to over-ride its (usually) good intention of inferring it.
Here's a reproducible walkthrough:
import pandas as pd
data = ["01", "02", "03", "10", "11", "", None]
someDF = pd.DataFrame(data)
someDF.to_csv("accts.csv", index=False)
Now read it back:
pd.read_csv("accts.csv")
You get this:
0
0 1.0
1 2.0
2 3.0
3 10.0
4 11.0
5 NaN
6 NaN
Not what you had in mind.
Now try it by specifying the datatype explicitly:
importedDF = pd.read_csv("accts.csv", names=["Account"], header=0, dtype=str).fillna(value="")
print(importedDF)
Account
0 01
1 02
2 03
3 10
4 11
5
6
Notes:
- some good info here on the background of the problem of integer columns with nulls getting auto-converted to floats: https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html
- if you pass in a dict of column names and datatypes, you can pick and choose for which columns to control the datatype - details are here: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html