Rename x rows after value in DataFrame-CodePudding

I have a pandas Dataframe containing a time series of data. Every full second contains a string with the name of the point. I want to rename the next 4 values, that contain a random point id, after this row to the name of the first row with an added suffix

time ID
12:00:00,00 pointname1
12:00:00,20 12345
12:00:00,40 45645
12:00:00,60 78963
12:00:00,80 23432
12:00:01,00 pointname2
12:00:01,20 53454
12:00:01,40 24324
12:00:01,60 24324
12:00:01,80 42435

I want to transform this into:

time ID
12:00:00,00 pointname1
12:00:00,20 pointname1_1
12:00:00,40 pointname1_2
12:00:00,60 pointname1_3
12:00:00,80 pointname1_4
12:00:01,00 pointname2
12:00:01,20 pointname2_1
12:00:01,40 pointname2_2
12:00:01,60 pointname2_3
12:00:01,80 pointname2_4

I have a working solution by iterating over the entire DataFrame, detecting the 'pointname' rows and renaming the 4 rows after that. However, that takes a very long time with the 1.3million rows the data contains. Is there a more clever and efficient way of doing this?

CodePudding user response：

Use Series.str.startswith with Series.where for set missing values to not matched values and then forward filling them, last use counter by GroupBy.cumcount and add values without first:

df['ID'] = df['ID'].where(df['ID'].str.startswith('pointname')).ffill()
df['ID']  =  df.groupby('ID').cumcount().astype(str).radd('_').replace('_0','')

print (df)
          time            ID
0  12:00:00,00    pointname1
1  12:00:00,20  pointname1_1
2  12:00:00,40  pointname1_2
3  12:00:00,60  pointname1_3
4  12:00:00,80  pointname1_4
5  12:00:01,00    pointname2
6  12:00:01,20  pointname2_1
7  12:00:01,40  pointname2_2
8  12:00:01,60  pointname2_3
9  12:00:01,80  pointname2_4

CodePudding user response：

You can to_numeric (or str.startswith if your identifier is literal, the only important point is to have True for the rows to use as referenc) to identify the ID rows, then for all other rows use ffill and groupby.cumcount to make the new identifier:

# find rows with string identifier (could use other methods)
m = pd.to_numeric(df['ID'], errors='coerce').isna()
# or if "pointname" is literal
# m = df['ID'].str.startswith('pointname') 


# for non matching rows, use previous value
# and add group counter
df.loc[~m, 'ID'] = (df['ID'].where(m).ffill()
                     '_'
                     df.groupby(m.cumsum()).cumcount().astype(str)
                   )

output:

          time            ID
0  12:00:00,00    pointname1
1  12:00:00,20  pointname1_1
2  12:00:00,40  pointname1_2
3  12:00:00,60  pointname1_3
4  12:00:00,80  pointname1_4
5  12:00:01,00    pointname2
6  12:00:01,20  pointname2_1
7  12:00:01,40  pointname2_2
8  12:00:01,60  pointname2_3
9  12:00:01,80  pointname2_4

CodePudding user response：

You can groupby the time part of time column and transform ID column to add suffix to first value in each group.

df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
            ['ID'].transform(lambda col: [f'{col.iloc[0]}_{s}' for s in [''] list(range(1, len(col)))])
            .str.rstrip('_'))
# or
df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
            ['ID'].transform(lambda col: [col.iloc[0]]   [f'{col.iloc[0]}_{s}' for s in range(1, len(col))]))

print(df)

          time            ID
0  12:00:00,00    pointname1
1  12:00:00,20  pointname1_1
2  12:00:00,40  pointname1_2
3  12:00:00,60  pointname1_3
4  12:00:00,80  pointname1_4
5  12:00:01,00    pointname2
6  12:00:01,20  pointname2_1
7  12:00:01,40  pointname2_2
8  12:00:01,60  pointname2_3
9  12:00:01,80  pointname2_4