Home > front end >  Extract subsequences from main dataframe based on the locations in another dataframe
Extract subsequences from main dataframe based on the locations in another dataframe

Time:09-26

I want to extract the subsequences indicated by the first and last locations in data frame 'B'. The algorithm that I came up with is:

  1. Identify the rows of B that fall in the locations of A
  2. Find the relative position of the locations (i.e. shift the locations to make them start from 0)
  3. Start a for loop using the relative position as a range to extract the subsequences.

The issue with the above algorithm is runtime. I require an alternative approach to compile the code faster than the existing one.

Desired output:

first last sequences
  3   5  ACA
  8  12  CGGAG
105 111  ACCCCAA
115 117  TGT

Used data frames:

import pandas as pd
A = pd.DataFrame({'first.sequence': ['AAACACCCGGAG','ACCACACCCCAAATGTGT'
                   ],'first':[1,100], 'last':[12,117]})

B = pd.DataFrame({'first': [3,8,105,115], 'last':[5,12,111,117]})

CodePudding user response:

One solution could be as follows:

out = pd.merge_asof(B, A, on=['last'], direction='forward',
                    suffixes=('','_y'))

out.loc[:,['first','last']] = \
    out.loc[:,['first','last']].sub(out.first_y, axis=0)

out = out.assign(sequences=out.apply(lambda row: 
          row['first.sequence'][row['first']:row['last'] 1], 
          axis=1)).drop(['first.sequence','first_y'], axis=1)

out.update(B)
print(out)

   first  last sequences
0      3     5       ACA
1      8    12     CGGAG
2    105   111   ACCCCAA
3    115   117       TGT

Explanation

  • First, use df.merge_asof to match first values from B with first values from A. I.e. 3, 8 will match with 1, and 105, 115 will match with 100. Now we know which string (sequence) needs splitting and we also know where the string starts, e.g. at index 1 or 100 instead of a normal 0.
  • We use this last bit of information to find out where the string slice should start and end. So, we do out.loc[:,['first','last']].sub(out.first_y, axis=0). E.g. we "reset" 3 to 2 (minus 1) and 105 to 5 (minus 100).
  • Now, we can use df.apply to get the string slices for each sequence, essentially looping over each row. (if your slices would have started and ended at the same indices, we could have used Series.str.slice instead.
  • Finally, we assign the result to out (as col sequences), drop the cols we no longer need, and we use df.update to "reset" the columns first and last.
  • Related