Home > front end >  A single Python script involving np.linalg.eig is inexplicably taking 128 CPUs?
A single Python script involving np.linalg.eig is inexplicably taking 128 CPUs?

Time:01-29

Note: The problem seems to be related to np.linalg.eig and eigsh and scipy.sparse.linalg.eigsh. For scripts not involving these functions, everything on the AWS box works as expected.

The most basic script I have found with the problem is:

import numpy as np
for i in range(0, num_iter):
    x=np.linalg.eig(np.random.rand(1000,1000))

I'm having a very bizarre error on AWS where a basic python script that calculates eigenvalues is using 100% of 64 cores (and is going no faster because of it).

Objective: Run computationally intensive python code. The code is parallel for loop, where each iteration is independent. I have two versions of this code, a basic version without multiprocessing, and one using the multiprocessing module.

Problem: The virtual machine is a c6i.32xlarge box on AWS with 64 cores/128 threads.

  • On my personal machine, using 6 cores is roughly ~6 times faster when using the parallelized code. Using more than 1 core with the same code on the AWS box makes the runtime slower.

Inexplicable Part:

  • I tried to get around this by setting up multiple copies the basic script using &, and this doesn't work either. Running n copies causes them all to be slower by a factor of 1/n. Inexplicably, a single instance of the python script uses all the cores of the machine. Unix command TOP indicates 6400% of the CPUs being used (i.e. all of them), and AWS CPU usage monitoring confirms 100% usage of the machine. I don't see how this is possible given GIL.

Partial solution? Specifying the processor fixed the issue somewhat:

  • Running the commands taskset --cpu-list i my_python_script.py & for i from 1 to n, they do indeed run in parallel, and the time is independent of n (for small n). The expected CPU usage statistics on the AWS monitor are what you would expect. The speed here when using one processor was the same as when the script ran and was taking all the cores of the machine.

Note: The fact that the runtime on 1 processor is the same suggests it was really running on 1 core all along, and the others are somehow being erroneously used.

Question:

Why is my basic python script taking all 64 cores of the AWS machine while not going any faster? How is this error even possible? And how can I get it to run simply with multiprocessing without using this weird taskset --cpu-list work around?

I had the exact same problem on the Google Cloud Platform as well.

The basic script is very simple:

from my_module import my_np_and_scipy_function
from my_other_module import input_function

if __name__ == "__main__":
    output = []
    for i in range(0, num_iter):
        result = my_np_and_scipy_function(kwds, param = input_function)
        output.extend(result)

With multiprocessing, it is:

from my_module import my_np_and_scipy_function

if __name__ == "__main__":

    pool = multiprocessing.Pool(cpu_count)
    for i in range(0, num_iter):
        result = pool.apply_async(my_np_and_scipy_function,kwds={"param":input_function,...},
        )
        results.append(result)

    output = []
    for x in results:
        output.extend(x.get())

CodePudding user response:

Numpy use multiprocessing in some random functions. So it is possible. You can see here https://github.com/numpy/numpy/search?q=multiprocessing

CodePudding user response:

Following the answers in the post, Limit number of threads in numpy, the numpy eig functions and the scripts work properly by putting the following lines of code at the top of the script:

import os

os.environ["MKL_NUM_THREADS"] = "1"
os.environ["NUMEXPR_NUM_THREADS"] = "1"
os.environ["OMP_NUM_THREADS"] = "1"
  • Related