Home > Software engineering >  How to find a group midpoint of a pandas or python array?
How to find a group midpoint of a pandas or python array?

Time:03-30

There is an array, that look like that (it's actually a column in a pandas dataframe, but any suggestions how to make in a plain python would also work)

[0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1, 0]

For each subsequence of 1s I need to find a midpoint position: an index of a point in the middle of this subsequence, or the closest to it. So for the example above, these would be 6 for the first subsequence, 18 for the second etc.

It can be easily done with just a naive looping, but I wonder if there is more efficient way (maybe built-in pandas function?)

CodePudding user response:

For a pure Python solution, you can use itertools.groupby which groups keys based on sequential uniqueness by default.

import itertools

data = [0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1, 0]

start = 0
midpoints = []
for key, group in itertools.groupby(data):
    group = list(group)
    if key == 1:
        midpoints.append(start   (len(group) // 2))
    start  = len(group)

print(midpoints)
[6, 18, 25]

For a pandas solution, we filter our data first, then use some groupby tricks to perform a groupby operation akin to itertools.groupby, and finally get the size and start position of each group. From there, we simply add the start position of the group to half of the size, and we get the approximate midpoint.

import pandas as pd

s = pd.Series([0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1, 0])

midpoints = (
    s.loc[lambda s: s.eq(1)]
    .groupby(s.ne(s.shift()).cumsum())
    .agg(['idxmin', 'size'])
    .eval('size // 2   idxmin')
)

print(midpoints)
2     6
4    18
6    25
dtype: int64

CodePudding user response:

Try with groupby:

  1. Use the series (i.e. column) index to groupby sequences of 0s and 1s with srs.ne(srs.shift()).cumsum()
  2. Get the average of the first and last indices for each sequence
  3. Keep only the unique values where the original column value is 1
srs = pd.Series([0,0,0,0,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,0,0,0,1,1,1,0])
g = srs.index.to_series().groupby(srs.ne(srs.shift()).cumsum())

>>> g.transform("first").add(g.transform("last")).floordiv(2).where(srs.eq(1)).dropna().unique()
array([ 6., 18., 25.])
  • Related