Faster PyRanges Generation in Python-CodePudding

I have a list of starting coordinates all on the same chromosome for features of fixed size and I am trying to generate a PyRanges object.

I timed the generation of the PyRanges object on a list of 125 coordinates and it took around 3.5ms. This seemed slower than expected (it is my first time using this library) so I tried to measure the speed of the same process on a list of different size

These are the result of the performance tests:

N = 1: 3.03ms
N = 10: 2.96ms
N = 100: 3.33ms
N = 125: 3.24ms
N = 200: 3.11ms
N = 500: 3.12ms
N = 10000: 6.86ms
N = 100000: 32.6ms

It looks like there is a basal time required for the creation of the PyRanges object (with N = 1, it still takes some time) and then, while the time seems to depend on the amount of features, the relation doesn't seem too drastic. Indeed, creating a PyRanges object of 10000 items takes just 2x the time to create one with just 10.

This is the code I'm using:

chr = "chrX"
size = 10
N = 1
points = np.array([random.randint(0, 1000000) for i in range(N)])

genomic_range = pr.PyRanges(
                chromosomes= chr,
                starts = points,
                ends = points   size - 1
            )

Am I doing something wrong? Why does the generation of a PyRange take this time even for few items?

CodePudding user response：

Put it shortly, PyRanges uses Pandas internally which is extremely slow for small inputs.

You function call goes in _init which calls create_pyranges_df which itself execute the following line:

chromosomes = pd.Series([chromosomes] * len(starts), dtype="category")

This lines takes 0.25 ms on my machine which is extremely slow for such very small input (I expect this to be at least 100 times faster). The dtype="category" seems to be the reason why it is slow.

The slowest part of the code is located here:

for s in columns:
    if isinstance(s, pd.Series):
        s = pd.Series(s.values, index=idx)  # This line is executed several time
    else:
        s = pd.Series(s, index=idx)

    series_to_concat.append(s)

Having 10 columns and ~0.2 ms to create each pd.Series object results in ~2 ms spent for doing almost nothing...