I'm trying out Bodo (https://docs.bodo.ai/2022.5/) to speed up certain Pandas operations, the first being pd.read_csv(...)
. Bodo requires the compatible pandas code to be in a separate function, separate from non-Bodo compatible code. For example, this is my code:
With Bodo:
import bodo
@bodo.jit
def loadDataFileWithJIT(filePath):
df = pd.read_csv(filePath, header=0, sep="\t", names=["patid", "eventdate", "prodcode", "consid", "issueseq"],
usecols=[0, 1, 3, 4, 12],
dtype={"patid": "str", "eventdate": "str", "prodcode": "str", "consid": "str", "issueseq": "str"},
low_memory=False)
return df
Over 5 files I see these times:
- 14.24 <--- first time, so this is when JIT compiles
- 9.67
- 10.72
- 9.51
- 9.42
Without Bodo (the function decorator and import statement have been removed... nothing else has changed):
- 4.66
- 4.68
- 4.59
- 4.61
- 4.60
Each file is approximately 170MB. I would appreciate people's thoughts. Many thanks.
UPDATE
Having spoken with the authors of Bodo I need to be running Python from mpiexec -n # (where # is number of cores > 1) if I'm to see a speed up.
CodePudding user response:
TLDR: speeding up I/O operations requires parallelism. You'd need to use mpiexec with more than one process.
Bodo currently reuses pandas read_csv under the hood to ensure full compatibility. JIT compilation enables parallelism, but doesn’t improve anything on a single core (and in fact has some overhead as you are observing).
You can use ipyparallel to launch and manage Bodo/MPI processes within a single process: https://github.com/ipython/ipyparallel
Bodo Slack discussion: https://bodocommunity.slack.com/archives/C01KRTQ1KDY/p1661704632557289