I am working a database that is very poorly organized. There are CustomerIds that are somehow bigger than int64. Here is an example: 88168142359034442077.0
In order to be able to use this ID, I need to turn it into a string and remove the decimal. I have tried to use the following code:
testdf = pd.DataFrame({'CUSTID': ['99418675896216.02342351', '88168142359034442077.0213', '53056496953']})
testdf['CUSTID'] = testdf['CUSTID'].astype('float64').astype('int64').astype(str)
testdf.display()
When I use the above method I get an overflow and then the numbers that are bigger than int64 becomes negative like: -9223372036854775808 for 88168142359034442077.0213
I have being looking for other ways to be able to make the the change from string to float, then float to int, and finally int to string again.
One method that I tried is to just not use astype('int64'), but it makes the the output into scientific format like: 8.816814235903445e 19 for 88168142359034442077.0213 and other than using regex to remove the decimal and 'e 19' I don't really see what else I can do.
Any information is appreciated. Thanks!
CodePudding user response:
Posting as an Answer because this became too large and I believe has further value
I'd be very surprised if those values are the real and expected IDs and not an erroneous artifact of importing some text or binary format
Specifically, the authoring program(s) and database itself are almost-certainly not using some high-memory decimal representation for a customer identifier, and would instead be "normal" types such as an int64 if they are represented that way at all!
Further, floating-point values expose programs to IEEE 754 floating point aliasing woes (see Is floating point math broken?), which will subtly foil all sorts of lookups and comparisons, and generally just wouldn't be able to pleasantly or consistently represent these values, so it's unlikely that anyone would reasonably use them
A contrived example
>>> data1 = "111001111001110100110001111000110110110111110101111000111001110110110010110001110110101110110000110010110011110100110010110011110101110001"
>>> data2 = "111000111000110001110110111000110001110100110010110011110101111001110000110011110100110100110100110010110000110111110111101110110000110010110001110011"
>>> for data in (data1, data2):
... print("".join(chr(eval("0b" data[block:block 6])) for block in range(0, len(data), 6)))
...
99418675896216.02342351
88168142359034442077.0213
It's a long shot, but perhaps a fair suspicion that this can happen when
- someone is entering a new entry, but doesn't have a customer ID (yet?)
- a UI is coded to only accept numeric strings
- there is no other checking and the database stores the value as a string
You could attempt to do another comparison of these to see for example if
- they are all from a specific user
- they are all from a specific date
- the string representation becomes longer or shorter as time progresses (as the user becomes lazier or less sure they have used a value)
CodePudding user response:
testdf['CUSTID']
is a pandas.Series
object containing Python string objects. For a pandas.Series
object to contain large integer, the most straightforward type to use is the int
Python objects (as opposed to native Numpy types that are more efficient). You can do a conversion to a Decimal
type so to get ride of the non-integer part. The conversion can be done using map
:
testdf['CUSTID'] = list(map(int, map(decimal.Decimal, testdf['CUSTID'].to_list())))
This is not very efficient, but both Unicode string objects and large variable-sized integer objects are actually inefficient. Since Numpy does not support large integer natively, this is certainly the best option (though one may find a faster way to get ride of the non-integer part than using the decimal package).
Here is a string-based parsing method that is certainly slower but supporting very large integers without using a large fixed-size decimal precision:
testdf['CUSTID'] = [int(s.split('.')[0]) for s in testdf['CUSTID'].to_list()]
CodePudding user response:
I would recommend just leave them as string and trim everything after the .
:
import pandas as pd
testdf = pd.DataFrame({'CUSTID': ['99418675896216.02342351', '88168142359034442077.0213', '53056496953']})
testdf['CUSTID'] = testdf['CUSTID'].apply(lambda s: s[:s.find(".")])
testdf.display()
Note that you could replace: lambda s: s[:s.find(".")]
with something different, but I would not expect any variation (e.g. lambda s: s.split(".", 1)[0]
or lambda s: re.match(r"^(\d )(?:\.(\d ))?$", s).groups()[0]
) to be much further than that. Just test them for some sample input to see which one works best for you.
Alternatively, you may want to use str
method for Pandas series with extract()
, i.e.:
testdf['CUSTID'] = testdf['CUSTID'].str.extract(r"^(\d )(?:\.(\d ))?$")
but I am unsure this would be any faster than the aforementioned solutions.
Perhaps you can achieve something faster with rstrip()
but your code would not be as simple as the above, since you would need to handle values without the .
differently (no-op) from the rest.