I have a pandas dataframe with three columns structured like this:
Sample Start End
<string> <int> <int>
The values in "Start" and "End" are intervals of positions on a larger string (e.g. from position 9000 to 11000). My goal is to subdivide the larger string into windows of 10000 positions, and count how many of those are contained in intervals from my dataframe.
For example, window 0:10000 would contain 1000 positions and window 10000:20000 would contain the other 1000 positions from interval 9000:11000.
To do this, I am first running a function to split these intervals into windows, such that if this is the input:
Sample Start End
A 2500 5000
A 9000 11000
A 18000 19500
Then this is the output:
Sample Start End W_start W_end
A 2500 5000 0 10000
A 9000 10000 0 10000
A 10000 11000 10000 20000
A 18000 19500 10000 20000
This is the function I'm doing it with, where df_sub
is a line of the dataframe and w_size
is the window size (10000):
def split_into_windows(df_sub, w_size):
start, end = df_sub.Start, df_sub.End
w_start = start - (start % w_size)
w_end = w_start w_size
if (w_start <= start <= w_end) and (w_start <= end <= w_end):
df_out = df_sub
elif (w_start <= start <= w_end) and (end > w_end):
out = []
df_tmp = df_sub.copy()
df_tmp.End = w_end
out.append(df_tmp.copy())
while (end > w_end):
w_start = w_size
w_end = w_size
df_tmp.Start = max(start, w_start)
df_tmp.End = min(end, w_end)
out.append(df_tmp.copy())
df_out = pd.DataFrame(out)
return df_out
I'm calling the function with apply()
:
df = df.apply(split_into_windows, axis=1, args=(w_size,))
But I'm getting this error:
ValueError: Buffer has wrong number of dimensions (expected 1, got 2)
Looking online I found that this issue seems to be related with pandas merge but I am not using pandas merge. I believe it may be related to the fact that some lines produce a single output series, and some others produce a small dataframe (the split ones).
See here:
Sample A
Start 6928
End 9422
Sample Start End
0 A 9939 10000
1 A 10000 11090
Any tips on how to fix this?
Minimal dataset to reproduce: https://file.io/iZ3fguCFlRbq
EDIT #1:
I tried changing a line in the function to have a coherent output (i.e. returning dataframes only):
df_out = df_sub.to_frame().T
And now the apply()
round "works", as in throws no errors, but the output looks like this:
0 Sample Start End
0 A 0 6915
1 Sample Start End
0 A 6928 9422
2 Sample Start End
0 A 9939 10000
...
<class 'pandas.core.series.Series'>
EDIT #2:
I cannot use .iterrows()
, it takes too long (estimate: weeks) with the size of dataframe I'm operating with.
EDIT #3:
Using multiprocessing
like this made me get through the day but it is still a suboptimal solution, compared to what I could achieve with a functioning apply()
call and a pandas parallel application such as pandarallel
or swifter
. Still looking for any tip :)
pool = mp.Pool(processes=48)
q = mp.Manager().Queue()
start = time.time()
for index, row in df_test.iterrows():
pool.apply_async(split_into_windows, args=(row, w_size, q))
pool.close()
pool.join()
out = []
while q.empty() == False:
out.append(q.get())
df = pd.DataFrame(out)
CodePudding user response:
If I understand everything correctly, here is a possible solution:
import pandas as pd
window_step = 10000
# Get indices of the window for start and end (here, the end is inclusive).
df['start_loc'] = df['Start'] // window_step
df['end_loc'] = (df['End']-1) // window_step
# Build the intervals for the W_start and W_end columns for each row.
intervals = [list((s*window_step, (s 1)*window_step) for s in range(r[0], r[1] 1))
for r in zip(df['start_loc'], df['end_loc'])]
# Insert in df and explode the interval column to get extra rows.
df['interval'] = intervals
df = df.explode(column='interval')
# Split the interval in two columns.
df[['W_start', 'W_end']] = pd.DataFrame(df['interval'].tolist(), index=df.index)
# Correct the starts and ends that are wrong because duplicated with explode.
wrong_ends = df['End'].to_numpy() > df['W_end'].to_numpy()
df.loc[wrong_ends, 'End'] = df.loc[wrong_ends, 'W_end']
wrong_starts = df['Start'].to_numpy() < df['W_start'].to_numpy()
df.loc[wrong_starts, 'Start'] = df.loc[wrong_starts, 'W_start']
df = df.drop(columns=['start_loc', 'end_loc', 'interval'])
print(df)
Sample Start End W_start W_end
0 A 2500 5000 0 10000
1 A 9000 10000 0 10000
1 A 10000 11000 10000 20000
2 A 18000 19500 10000 20000
Then, from here, to calculate the number of positions included in each window you could do:
df['included_positions'] = df['End'] - df['Start']
sample_win_cnt = df.groupby(['Sample', 'W_start', 'W_end']).sum().drop(columns=['Start', 'End'])
print(sample_win_cnt)
included_positions
Sample W_start W_end
A 0 10000 3500
10000 20000 2500
Here I grouped by 'Sample'
as well. I am not sure this is what you want. If not, you can also just group by 'W_start'
and 'W_end'
.
Output with the other example:
Input:
Sample Start End
0 A 9939 10000
1 A 10000 11090
Interval result:
Sample Start End W_start W_end
0 A 9939 10000 0 10000
1 A 10000 11090 10000 20000
Counts:
included_positions
Sample W_start W_end
A 0 10000 61
10000 20000 1090
I tested it on a DataFrame with >1M rows and it seemed to calculate the results in less than a second.
CodePudding user response:
@user2246849 is perfect. I only think it's a little hard to follow when it comes to define intervals.
My suggestion here is to play with a row only to define a function that take a row and return the intervals. I mean given df
you take x = df.iloc[1]
and build a function which return [[0, 10_000], [10_000, 20_000]]
import pandas as pd
df = pd.DataFrame(
{'Sample': {0: 'A', 1: 'A', 2: 'A'},
'Start': {0: 2500, 1: 9000, 2: 18000},
'End': {0: 5000, 1: 11000, 2: 19500}})
def get_intervals(x, window_step):
out = [
[i * window_step,
(i 1) * window_step]
for i in range(
x["Start"] // window_step,
(x["End"] - 1) // window_step 1)]
return out
And we assign intervals with an apply
df["intervals"] = df.apply(
lambda x: get_intervals(x, window_step), axis=1)
which return
Sample Start End intervals
0 A 2500 5000 [[0, 10000]]
1 A 9000 11000 [[0, 10000], [10000, 20000]]
2 A 18000 19500 [[10000, 20000]]
From now on you can follow the other answer.