I have a dataframe that looks like this
import pandas as pd
df = pd.DataFrame({'Timestamp': ['1642847484', '1642847484', '1642847484', '1642847484', '1642847487', '1642847487','1642847487','1642847487','1642847487','1642847487','1642847487','1642847487', '1642847489', '1642847489', '1642847489'],
'value': [11, 10, 14, 20, 3, 2, 9, 48, 5, 20, 12, 20, 56, 12, 8]})
I need to do some operations on each group of values with the same timestamp , so I use groupBy as follows :
df_grouped = df.groupby('Timestamp')
And then iterate over the rows of each group and append the results row by row in a new dataframe:
df_out = pd.DataFrame(columns=( 'Timestamp', 'value'))
for group_name, df_group in df_grouped:
i = 0
for row_index, row in df_group.iterrows():
row['Timestamp'] = row['Timestamp']* 1000 i * 30
df_out = df_out.append(row)
i = i 1
print(df_out.tail())
But my approach takes so much time (7M rows ) and I was wondering if there is a more efficient way to do so . Thank you
CodePudding user response:
I think itterows
here is not necessary, you can use:
def f(x):
x['Timestamp'] = ...
....
return x
df1 = df.groupby('Timestamp').apply(f)
EDIT: Create counter Series
by GroupBy.cumcount
, multiple and add to Timestamp
:
#if necessary
df['Timestamp'] = df['Timestamp'].astype(np.int64)
df['Timestamp'] = df['Timestamp'] * 1000 df.groupby('Timestamp').cumcount() * 30
print(df)
Timestamp value
0 1642847484000 11
1 1642847484030 10
2 1642847484060 14
3 1642847484090 20
4 1642847487000 3
5 1642847487030 2
6 1642847487060 9
7 1642847487090 48
8 1642847487120 5
9 1642847487150 20
10 1642847487180 12
11 1642847487210 20
12 1642847489000 56
13 1642847489030 12
14 1642847489060 8