Probably obvious, but just wanted to confirm:
What would be the best suited data type for a numerical ID in pandas?
Let's say that I have a sequential numerical ID type user_id
, which would be better:
- an
int64
type (that would seem to be the most obvious choice given the numerical representation of the field) - a
category
type (which might make more sense, given that the ID is not to be used for actual numerical operations, but rather as a unique identifier)
Same question for characters-based IDs, would it be better to use an object
or a category
type?
I would be tempted to use the category
data type (thinking there might be performance benefits, as I would imagine these categories are somehow optimised/hashed/indexed for performance), but I was wondering whether this data type is more suited for a more limited subset of distinct values than the possible 100's of thousands unique user_ids I might have in my dataset.
Thanks!
CodePudding user response:
Operating on object-typed dataframes/arrays is slow because Pandas needs to operate on each item using the inefficient CPython interpreter. This causes a high overhead due to reference counting, internal pointer indirections, type checks, internal function calls, etc. Pandas often uses Numpy internally which can be much faster when the types are native like int64
, int32
, float64
, etc. In that case, Numpy can execute a optimized native code that is not slowed down by the CPython overheads and that can even benefit from hardware SIMD units (regarding the target function used). While Numpy supports bounded strings, Pandas does not use this but slow CPython string objects instead. Strings are inherently slow, even in native codes, because of their generally variable size that is often predictable (this strongly impacts the processor that need to predict branches so to be fast, see this post about branch prediction). In practice, unicode characters make strings even slower (it makes the use of SIMD instructions very difficult and branch even harder to predict). Categorial are basically integers associated with a mapping table (of unique values). Categorial columns can theoretically be faster for some computation because the table is already computed. However, the initial computation of the table can be expensive. Additionally, the table is not always used efficient where it could resulting sometimes to a surprisingly slower execution compared to integers. Not to mentions the table can be big when all the values are different. Integers are the less expensive type. Smaller integer can often be faster. Indeed, SIMD vectors have a fixed size (eg. the AVX-2 SIMD instruction set of 86-64 processors can compute 32 int8
value in a row compared to only 4 int64
). Furthermore, smaller items cause the whole columns to take less memory reducing the memory throughput so it improve the performance of memory-bound codes (starting from dataframe copies that are pretty frequent in Pandas). However, this is not always faster because smaller types can sometime cause type-conversion adding an additional overhead (though this overhead can be mitigated using lower-level optimizations). Thus, if you are working on huge dataframe, please consider using small integer types. Otherwise, int64
is certainly a very good option.