Home > Back-end >  Pandas read_csv with JIT Bodo is slower than regular python
Pandas read_csv with JIT Bodo is slower than regular python

Time:08-29

I'm trying out Bodo (https://docs.bodo.ai/2022.5/) to speed up certain Pandas operations, the first being pd.read_csv(...). Bodo requires the compatible pandas code to be in a separate function, separate from non-Bodo compatible code. For example, this is my code:

With Bodo:

import bodo

@bodo.jit
def loadDataFileWithJIT(filePath):
    df = pd.read_csv(filePath, header=0, sep="\t", names=["patid", "eventdate", "prodcode", "consid", "issueseq"],
                       usecols=[0, 1, 3, 4, 12],
                       dtype={"patid": "str", "eventdate": "str", "prodcode": "str", "consid": "str", "issueseq": "str"},
                       low_memory=False)
    return df

Over 5 files I see these times:

  • 14.24 <--- first time, so this is when JIT compiles
  • 9.67
  • 10.72
  • 9.51
  • 9.42

Without Bodo (the function decorator and import statement have been removed... nothing else has changed):

  • 4.66
  • 4.68
  • 4.59
  • 4.61
  • 4.60

Each file is approximately 170MB. I would appreciate people's thoughts. Many thanks.

UPDATE

Having spoken with the authors of Bodo I need to be running Python from mpiexec -n # (where # is number of cores > 1) if I'm to see a speed up.

CodePudding user response:

TLDR: speeding up I/O operations requires parallelism. You'd need to use mpiexec with more than one process.

Bodo currently reuses pandas read_csv under the hood to ensure full compatibility. JIT compilation enables parallelism, but doesn’t improve anything on a single core (and in fact has some overhead as you are observing).

You can use ipyparallel to launch and manage Bodo/MPI processes within a single process: https://github.com/ipython/ipyparallel

Bodo Slack discussion: https://bodocommunity.slack.com/archives/C01KRTQ1KDY/p1661704632557289

  • Related