Pandas: Most computationally efficient way to combine consecutive rows with conditions-CodePudding

Say I have dataframe like this

df = pd.DataFrame({
'position': ['head', 'tail', 'head', 'head', 'head', 'tail', 'tail', 'head'], 
        'start': [2, 13, 54, 320, 654, 677, 3430, 9000],
        'end': [4, 15, 564, 390, 674, 679, 6000, 9010],
    })        #s.  e.   k    k.   s.        e.    k 
df.head(10)


  position  start   end
0     head      2     4
1     tail     13    15
2     head     54   564
3     head    320   390
4     head    654   674
5     tail    677   679
6     tail   3430  6000
7     head   9000  9010

I want to combine rows such that if the position label if the row is 'head' and then consecutive position is 'tail', then those rows should be combined in such a way that the 'start' value from 'head' is used, and 'end' value of 'tail' is used. And there are multiple consecutive 'tail' rows that follow a 'head' row, then the middle 'tail' rows will be skipped.

It's tricky to explain, but here's an example dataframe of what the desired result should look like

    position    start   end
0   tail            2    15
1   head           54   564
2   head          320   390
3   tail          654   6000
4   head         9000   9010

I came up with this solution using iterrows

previous = None
list_dicts = []
for idx, row in df.iterrows():
    if row['position'] == 'head':
        if previous:
            package = {'position': previous, 'start':previous_start, 'end':previous_end}
            list_dicts.append(package)
        previous = 'head'
        previous_start = row['start']
        previous_end = row['end']
    elif row['position'] == 'tail':
        previous = 'tail'
        previous_start = previous_start
        previous_end = row['end']
if row['position'] == 'head':
    package = {'position': row['position'], 'start':row['start'], 'end':row['end']}
elif row['position'] == 'tail':
    package = {'position': row['position'], 'start':previous_start, 'end':row['end']}
list_dicts.append(package)

pd.DataFrame(list_dicts).head(10)

But I have read that iterrows should be avoided because it's not the most computationally efficient way to manipulate dataframes. And in this case, I am resorting to creating a brand new dataframe. But in the case of using conditionals based on consecutive rows, it's the only solution I can think of.

CodePudding user response：

One way using pandas.groupby:

m = df["position"].eq("head").cumsum()
new_df = df.groupby(m, as_index=False).agg({"position": "last", 
                                            "start": "first", 
                                            "end": "last"})
print(new_df)

Output:

  position  start   end
0     tail      2    15
1     head     54   564
2     head    320   390
3     tail    654  6000
4     head   9000  9010