Home > other >  How to prevent Python/pandas from treating ids like numbers
How to prevent Python/pandas from treating ids like numbers

Time:07-16

I have a data set in a csv with some id's that are very long numbers like this:

963839330864351104
426545668232740352
811862613586429056

when I read the csv and convert my dataset into a dataframe, pandas incorrectly thinks it is a number and converts them to scientific notifation so those numbers above become this:

9.638393308643511e 17
4.2654566823274035e 17
8.11862613586429e 17

I tried to resolve this by turning the series to int64 and that works for the most part but here is the problem:

some numbers may look like this orignally

191268415833367997

now, after the number goes from scientific notation to int64 (or int), the number becomes this:

191268415833368000 #some how pandas think that it's ok to round the number up

what I need is to make pandas understand that the column that holds those id's in the dataset is a string column and it should not be treated as number, possibly as of the import so I'm not converting numbers into ints and then later into strings while I format work and transform my dataset to fit my needs.

what I've tried the following but it has not worked:

  1. I've tried converting the series into int and int64
  2. I've tried reading the csv with float_precision='round_trip'
  3. I've tried to convert the series holding the id's into a string
  4. I've tried to set the display differently: pd.options.display.float_format = '{:.0f}'.format

I've seen this question being asked in different ways but there's no definite answer or something I've been able to understand to adapt to my issue. it should be straigt forward but somehow it's not.

help is appreciated it.

CodePudding user response:

Since you mentioned you're loading from csv, you can simply inform pandas you want to treat that column as a string:

from io import StringIO
from pandas import read_csv

data = StringIO('''
id1,id2
963839330864351104,963839330864351104
426545668232740352,426545668232740352
811862613586429056,811862613586429056
191268415833367997,191268415833367997
''')

df = read_csv(data, dtype={'id1': str, 'id2': float})

print(
    df, df.dtypes, sep='\n'*2
)
                  id1           id2
0  963839330864351104  9.638393e 17
1  426545668232740352  4.265457e 17
2  811862613586429056  8.118626e 17
3  191268415833367997  1.912684e 17

id1     object
id2    float64
dtype: object

In this case, read_csv treats 'id1' as string values, and 'id2' as numeric (float) values since it detects that those are floating point numbers.

CodePudding user response:

I found this answer on another question:

Pandas reading csv as string type

this did the trick

pd.read_csv(f, dtype=str)

thank you

  • Related