Home > Mobile >  Improve Pandas performance for very large dataframes?
Improve Pandas performance for very large dataframes?

Time:01-22

I have a few Pandas dataframes with several millions of rows each. The dataframes have columns containing JSON objects each with 100 fields. I have a set of 24 functions that run sequentially on the dataframes, process the JSON (for example, compute some string distance between two fields in the JSON) and return a JSON with some new fields added. After all 24 functions execute, I get a final JSON which is then usable for my purposes.

I am wondering what the best ways to speed up performance for this dataset. A few things I have considered and read up on:

  • It is tricky to vectorize because many operations are not as straightforward as "subtract this column's values from another column's values".
  • I read up on some of the Pandas documentation and a few options indicated are Cython (may be tricky to convert the string edit distance to Cython, especially since I am using an external Python package) and Numba/JIT (but this is mentioned to be best for numerical computations only).
  • Possibly controlling the number of threads could be an option. The 24 functions can mostly operate without any dependencies on each other.

CodePudding user response:

You are asking for advice and this is not the best site for general advice but nevertheless I will try to point a few things out.

  1. The ideas you have already considered are not going to be helpful - neither Cython, Numba, nor threading are not going to address the main problem - the format of your data that is not conductive for performance of operations on the data.

  2. I suggest that you first "unpack" the JSONs that you store in the column(s?) of your dataframe. Preferably, each field of the JSON (mandatory or optional - deal with empty values at this stage) ends up being a column of the dataframe. If there are nested dictionaries you may want to consider splitting the dataframe (particularly if the 24 functions are working separately at separate nested JSON dicts). Alternatively, you should strive to flatten the JSONs.

  3. Convert to the data format that gives you the best performance. JSON stores all the data in the textual format. Numbers are best used in their binary format. You can do that column-wise on the columns that you suspect should be converted using df['col'].astype(...) (works on the whole dataframe too).

  4. Update the 24 functions to operate not on JSON strings stored in dataframe but on the fields of the dataframe.

  5. Recombine the JSONs for storage (I assume you need them in this format). At this stage the implicit conversion from numbers to strings will occur.

Given the level of details you provided in the question, the suggestions are necessarily brief. Should you have any more detailed questions at any of the above points, it would be best to ask maximally simple question on each of them (preferably containing a self-sufficient MWE).

  • Related