Home > OS >  Does pandas categorical data speed up indexing?
Does pandas categorical data speed up indexing?

Time:03-19

Somebody told me it is a good idea to convert identifying columns (e.g. person numbers) from strings to categorical. This would speed up some operations like searching, filtering and grouping.

I understand that a 40 chars strings costs much more RAM and time to compare instead of a simple integer.

But I would have some overhead because of a str-to-int-table for translating between two types and to know which integer number belongs to which string "number".

Maybe .astype('categorical') can help me here? Isn't this an integer internally? Does this speed up some operations?

CodePudding user response:

The user guide has the following about categorical data use cases:

The categorical data type is useful in the following cases:

  • A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory, see here.

  • The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of the lexical order, see here.

  • As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use suitable statistical methods or plot types).

See also the API docs on categoricals.

The book, Python for Data Analysis by Wes McKinney, has the following on this topic:

The categorical representation can yield significant performance improvements when you are doing analytics. You can also perform transformations on the categories while leaving the codes unmodified. Some example transformations that can be made at relatively low cost are:

  • Renaming categories
  • Appending a new category without changing the order or position of the existing categories

GroupBy operations can be significantly faster with categoricals because the underlying algorithms use the integer-based codes array instead of an array of strings.

Series containing categorical data have several special methods similar to the Series.str specialized string methods. This also provides convenient access to the categories and codes.

In large datasets, categoricals are often used as a convenient tool for memory savings and better performance.

  • Related