Pandas random n samples of consecutive rows / pairs-CodePudding

I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows

e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.

Additional condition: value difference can't exceed 4000.

index       value   distance
cg13869341  15865   1.635450
cg14008030  18827   4.161332

Then distance of the following etc

cg20826792  29425   0.657369
cg33045430  29407   1.708055

Sample original dataframe

index       value   distance
cg13869341  15865   1.635450
cg14008030  18827   4.161332
cg12045430  29407   0.708055
cg20826792  29425   0.657369
cg33045430  69407   1.708055
cg40826792  59425   0.857369
cg47454306  88407   0.708055
cg60826792  96425   2.857369

I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample

original shape is (480136, 14)

CodePudding user response：

If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:

N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index

# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)

# select those OR the next ones 
df_sample = df.loc[m|m.shift()]

Example output on the toy DataFrame (N=3):

        index  value  distance
2  cg12045430  29407  0.708055
3  cg20826792  29425  0.657369
4  cg33045430  69407  1.708055
5  cg40826792  59425  0.857369
6  cg47454306  88407  0.708055
7  cg60826792  96425  2.857369

increasing randomness

The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:

N = 20000
frac = 0.2

idx = (df
   .drop(df.sample(frac=frac).index)
   .loc[::2].sample(n=N)
   .index
 )

m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]

# check:
# len(df_sample)
# 40000

CodePudding user response：

Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).

import random

# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)

# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))

# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i 1, "value"]) > 4000 for i in c]
c = c[mask]

# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices   1

# Filter
df_sample = df[df.index.isin(c)]

# Restore original index if required.
df = df.set_index("index")

Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array