I have a pandas Dataframe containing a time series of data. Every full second contains a string with the name of the point. I want to rename the next 4 values, that contain a random point id, after this row to the name of the first row with an added suffix
time ID
12:00:00,00 pointname1
12:00:00,20 12345
12:00:00,40 45645
12:00:00,60 78963
12:00:00,80 23432
12:00:01,00 pointname2
12:00:01,20 53454
12:00:01,40 24324
12:00:01,60 24324
12:00:01,80 42435
I want to transform this into:
time ID
12:00:00,00 pointname1
12:00:00,20 pointname1_1
12:00:00,40 pointname1_2
12:00:00,60 pointname1_3
12:00:00,80 pointname1_4
12:00:01,00 pointname2
12:00:01,20 pointname2_1
12:00:01,40 pointname2_2
12:00:01,60 pointname2_3
12:00:01,80 pointname2_4
I have a working solution by iterating over the entire DataFrame, detecting the 'pointname' rows and renaming the 4 rows after that. However, that takes a very long time with the 1.3million rows the data contains. Is there a more clever and efficient way of doing this?
CodePudding user response:
Use Series.str.startswith
with Series.where
for set missing values to not matched values and then forward filling them, last use counter by GroupBy.cumcount
and add values without first:
df['ID'] = df['ID'].where(df['ID'].str.startswith('pointname')).ffill()
df['ID'] = df.groupby('ID').cumcount().astype(str).radd('_').replace('_0','')
print (df)
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4
CodePudding user response:
You can to_numeric
(or str.startswith
if your identifier is literal, the only important point is to have True
for the rows to use as referenc) to identify the ID rows, then for all other rows use ffill
and groupby.cumcount
to make the new identifier:
# find rows with string identifier (could use other methods)
m = pd.to_numeric(df['ID'], errors='coerce').isna()
# or if "pointname" is literal
# m = df['ID'].str.startswith('pointname')
# for non matching rows, use previous value
# and add group counter
df.loc[~m, 'ID'] = (df['ID'].where(m).ffill()
'_'
df.groupby(m.cumsum()).cumcount().astype(str)
)
output:
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4
CodePudding user response:
You can groupby
the time
part of time
column and transform
ID
column to add suffix to first value in each group.
df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
['ID'].transform(lambda col: [f'{col.iloc[0]}_{s}' for s in [''] list(range(1, len(col)))])
.str.rstrip('_'))
# or
df['ID'] = (df.groupby(pd.to_datetime(df['time']).dt.strftime('%H:%M:%S'))
['ID'].transform(lambda col: [col.iloc[0]] [f'{col.iloc[0]}_{s}' for s in range(1, len(col))]))
print(df)
time ID
0 12:00:00,00 pointname1
1 12:00:00,20 pointname1_1
2 12:00:00,40 pointname1_2
3 12:00:00,60 pointname1_3
4 12:00:00,80 pointname1_4
5 12:00:01,00 pointname2
6 12:00:01,20 pointname2_1
7 12:00:01,40 pointname2_2
8 12:00:01,60 pointname2_3
9 12:00:01,80 pointname2_4