Home > Software design >  ValueError with pandas apply function returning output of variable shape
ValueError with pandas apply function returning output of variable shape

Time:05-14

I have a pandas dataframe with three columns structured like this:

Sample    Start  End
<string>  <int>  <int>

The values in "Start" and "End" are intervals of positions on a larger string (e.g. from position 9000 to 11000). My goal is to subdivide the larger string into windows of 10000 positions, and count how many of those are contained in intervals from my dataframe.

For example, window 0:10000 would contain 1000 positions and window 10000:20000 would contain the other 1000 positions from interval 9000:11000.

To do this, I am first running a function to split these intervals into windows, such that if this is the input:

Sample    Start  End
A         2500   5000
A         9000   11000
A         18000  19500

Then this is the output:


Sample    Start  End    W_start  W_end
A         2500   5000   0        10000
A         9000   10000  0        10000
A         10000  11000  10000    20000
A         18000  19500  10000    20000

This is the function I'm doing it with, where df_sub is a line of the dataframe and w_size is the window size (10000):

def split_into_windows(df_sub, w_size):

    start, end = df_sub.Start, df_sub.End
    w_start = start - (start % w_size)
    w_end = w_start   w_size

    if (w_start <= start <= w_end) and (w_start <= end <= w_end):
        df_out = df_sub

    elif (w_start <= start <= w_end) and (end > w_end):
        out = []
        df_tmp = df_sub.copy()
        df_tmp.End = w_end
        out.append(df_tmp.copy())

        while (end > w_end):
            w_start  = w_size
            w_end  = w_size
            df_tmp.Start = max(start, w_start)
            df_tmp.End = min(end, w_end)
            out.append(df_tmp.copy())

        df_out = pd.DataFrame(out)

    return df_out

I'm calling the function with apply():

df = df.apply(split_into_windows, axis=1, args=(w_size,))

But I'm getting this error:

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Looking online I found that this issue seems to be related with pandas merge but I am not using pandas merge. I believe it may be related to the fact that some lines produce a single output series, and some others produce a small dataframe (the split ones).

See here:

Sample       A
Start     6928
End       9422

  Sample  Start    End
0      A   9939  10000
1      A  10000  11090

Any tips on how to fix this?

Minimal dataset to reproduce: https://file.io/iZ3fguCFlRbq

EDIT #1:

I tried changing a line in the function to have a coherent output (i.e. returning dataframes only):

df_out = df_sub.to_frame().T

And now the apply() round "works", as in throws no errors, but the output looks like this:

0  Sample  Start   End
0  A       0       6915
1  Sample  Start   End
0  A       6928    9422
2  Sample  Start   End
0  A       9939    10000
...

<class 'pandas.core.series.Series'>

EDIT #2:

I cannot use .iterrows(), it takes too long (estimate: weeks) with the size of dataframe I'm operating with.

EDIT #3:

Using multiprocessing like this made me get through the day but it is still a suboptimal solution, compared to what I could achieve with a functioning apply() call and a pandas parallel application such as pandarallel or swifter. Still looking for any tip :)

pool = mp.Pool(processes=48)
q = mp.Manager().Queue()

start = time.time()
for index, row in df_test.iterrows():
    pool.apply_async(split_into_windows, args=(row, w_size, q))

pool.close()
pool.join()

out = []
while q.empty() == False:
    out.append(q.get())

df = pd.DataFrame(out)

CodePudding user response:

If I understand everything correctly, here is a possible solution:

import pandas as pd

window_step = 10000

# Get indices of the window for start and end (here, the end is inclusive).
df['start_loc'] = df['Start'] // window_step 
df['end_loc'] = (df['End']-1) // window_step

# Build the intervals for the W_start and W_end columns for each row.
intervals = [list((s*window_step, (s 1)*window_step) for s in range(r[0], r[1] 1))
            for r in zip(df['start_loc'], df['end_loc'])]

# Insert in df and explode the interval column to get extra rows.
df['interval'] = intervals
df = df.explode(column='interval')

# Split the interval in two columns.
df[['W_start', 'W_end']] =  pd.DataFrame(df['interval'].tolist(), index=df.index)

# Correct the starts and ends that are wrong because duplicated with explode.
wrong_ends = df['End'].to_numpy() > df['W_end'].to_numpy()
df.loc[wrong_ends, 'End'] = df.loc[wrong_ends, 'W_end']
wrong_starts = df['Start'].to_numpy() < df['W_start'].to_numpy()
df.loc[wrong_starts, 'Start'] = df.loc[wrong_starts, 'W_start']

df = df.drop(columns=['start_loc', 'end_loc', 'interval'])

print(df)
  Sample  Start    End  W_start  W_end
0      A   2500   5000        0  10000
1      A   9000  10000        0  10000
1      A  10000  11000    10000  20000
2      A  18000  19500    10000  20000

Then, from here, to calculate the number of positions included in each window you could do:

df['included_positions'] = df['End'] - df['Start']

sample_win_cnt = df.groupby(['Sample', 'W_start', 'W_end']).sum().drop(columns=['Start', 'End'])
print(sample_win_cnt)
                      included_positions
Sample W_start W_end                    
A      0       10000                3500
       10000   20000                2500

Here I grouped by 'Sample' as well. I am not sure this is what you want. If not, you can also just group by 'W_start' and 'W_end'.

Output with the other example:

Input:

  Sample  Start    End
0      A   9939  10000
1      A  10000  11090

Interval result:

  Sample  Start    End  W_start  W_end
0      A   9939  10000        0  10000
1      A  10000  11090    10000  20000

Counts:

                      included_positions
Sample W_start W_end                    
A      0       10000                  61
       10000   20000                1090

I tested it on a DataFrame with >1M rows and it seemed to calculate the results in less than a second.

CodePudding user response:

@user2246849 is perfect. I only think it's a little hard to follow when it comes to define intervals.

My suggestion here is to play with a row only to define a function that take a row and return the intervals. I mean given df you take x = df.iloc[1] and build a function which return [[0, 10_000], [10_000, 20_000]]

import pandas as pd


df = pd.DataFrame(
    {'Sample': {0: 'A', 1: 'A', 2: 'A'},
     'Start': {0: 2500, 1: 9000, 2: 18000},
     'End': {0: 5000, 1: 11000, 2: 19500}})


def get_intervals(x, window_step):
    out = [
        [i * window_step, 
        (i   1) * window_step] 
     for i in range(
         x["Start"] // window_step, 
        (x["End"] - 1) // window_step   1)]
    return out

And we assign intervals with an apply

df["intervals"] = df.apply(
    lambda x: get_intervals(x, window_step), axis=1)

which return

  Sample  Start    End                     intervals
0      A   2500   5000                  [[0, 10000]]
1      A   9000  11000  [[0, 10000], [10000, 20000]]
2      A  18000  19500              [[10000, 20000]]

From now on you can follow the other answer.

  • Related