Home > Net >  How to use pyarrow parquet with multiprocessing
How to use pyarrow parquet with multiprocessing

Time:07-15

I want to read multiple hdfs files simultaneously using pyarrow and multiprocessing. The simple python script works (see below), but if I try to do the same thing with multiprocessing, then it hangs indefinitely.

My only guess is that env is different somehow, but all the environment variable should be the same in the child process and parent process.

I've tried to debug this using print(); setting to 1 thread only. To my surprise, this even fails when 1 thread only.

So, what can be the possible causes? How would I debug this?

Code:

import pyarrow.parquet as pq

def read_pq(file):
  table = pq.read_table(file)
  return table

##### this works #####
table = read_pq('hdfs://myns/mydata/000000_0')

###### this doesnt work #####
import multiprocessing

result_async=[]
with Pool(1) as pool:
  result_async.append( pool.apply_async(pq.read_table, args = ('hdfs://myns/mydata/000000_0',)) )
  results = [r.get() for r in result_async]  ###### hangs here indefinitely, no exceptions raised
  print(results)    ###### expecting to get List[pq.Table]

#########################

CodePudding user response:

Have you tried importing pq inside a user defined function so that any initilization required per process (needed by the library) can happen in each process in the pool?

def read_pq(file):
  import pyarrow.parquet as pq
  table = pq.read_table(file)
  return table


###### this doesnt work #####
import multiprocessing

result_async=[]
with Pool(1) as pool:
  result_async.append( pool.apply_async(read_pq, args = ('hdfs://myns/mydata/000000_0',)) )
  results = [r.get() for r in result_async]  ###### hangs here indefinitely, no exceptions raised
  print(results)    ###### expecting to get List[pq.Table]

#########################

CodePudding user response:

Problem is due to my lack of experience with multiprocessing.

Solution is to add:

from multiprocessing import set_start_method
set_start_method("spawn")

The solution and the reason is exactly what https://pythonspeed.com/articles/python-multiprocessing/ describes: The logging got forked and caused deadlock.

Furthermore, although I had only "Pool(1)", in fact, I had the parent process plus the child process, so I still had two process, so the deadlock problem existed.

  • Related