I have a list of starting coordinates all on the same chromosome for features of fixed size and I am trying to generate a PyRanges object.
I timed the generation of the PyRanges object on a list of 125 coordinates and it took around 3.5ms. This seemed slower than expected (it is my first time using this library) so I tried to measure the speed of the same process on a list of different size
These are the result of the performance tests:
N = 1: 3.03ms
N = 10: 2.96ms
N = 100: 3.33ms
N = 125: 3.24ms
N = 200: 3.11ms
N = 500: 3.12ms
N = 10000: 6.86ms
N = 100000: 32.6ms
It looks like there is a basal time required for the creation of the PyRanges object (with N = 1, it still takes some time) and then, while the time seems to depend on the amount of features, the relation doesn't seem too drastic. Indeed, creating a PyRanges object of 10000 items takes just 2x the time to create one with just 10.
This is the code I'm using:
chr = "chrX"
size = 10
N = 1
points = np.array([random.randint(0, 1000000) for i in range(N)])
genomic_range = pr.PyRanges(
chromosomes= chr,
starts = points,
ends = points size - 1
)
Am I doing something wrong? Why does the generation of a PyRange take this time even for few items?
CodePudding user response:
Put it shortly, PyRanges uses Pandas internally which is extremely slow for small inputs.
You function call goes in _init
which calls create_pyranges_df
which itself execute the following line:
chromosomes = pd.Series([chromosomes] * len(starts), dtype="category")
This lines takes 0.25 ms on my machine which is extremely slow for such very small input (I expect this to be at least 100 times faster). The dtype="category"
seems to be the reason why it is slow.
The slowest part of the code is located here:
for s in columns:
if isinstance(s, pd.Series):
s = pd.Series(s.values, index=idx) # This line is executed several time
else:
s = pd.Series(s, index=idx)
series_to_concat.append(s)
Having 10 columns and ~0.2 ms to create each pd.Series
object results in ~2 ms spent for doing almost nothing...