I have a pandas dataframe that looks like this:
id | year | month | day | hour | timestamp | location_statindex |
---|---|---|---|---|---|---|
TP6045276120589011 | 2022 | 09 | 01 | 02 | 2022-09-01 02:01:23 | 1 |
TP6031280073133032 | 2022 | 12 | 16 | 01 | 2022-12-16 01:48:42 | 1 |
with around 300.000 rows.
With dytpes
id object
year object
month object
day object
hour object
timestamp datetime64[ns]
location_statindex object
dtype: object
Now I create a new column which contains elements from each row, like:
1\2022\09\01\02
with this code:
df['folder_path'] = df[['location_statindex', 'year', 'month', 'day', 'hour']].agg('\\'.join, axis=1)
The question is now if it is possible to change the aggregation, so that the value from location_statindex is a different string depending on an if condition.
Like this:
location = '2'
if location in ['1','2','3','4']:
location = f'Platz_optimiert_{location}'
elif location in ['5','6']:
location = f'KSPlatz_{location}'
else:
location = f'Platz_optimiert_TEF_{location}'
Expected output is a new column in the same dataframe with a joined string from the element of the corresponing row. For example the following string for the first row.
'Platz_optimiert_1\2022\09\01\02'
I already have a solution that uses the apply function from pandas with a self written function which contains the if block. But thinking about the size of the dataframe I want to try to reduce the necessary time. Is aggregation even faster than apply?
CodePudding user response:
With single assignment, by means of numpy.where
routine.
Sample dataframe (note different location_statindex
values for representative test):
In [473]: df
Out[473]:
id year month day hour timestamp \
0 TP6045276120589011 2022 9 1 2 2022-09-01 02:01:23
1 TP6031280073133032 2022 12 16 1 2022-12-16 01:48:42
2 TP6031280073133032 2022 12 16 1 2022-12-16 01:48:42
location_statindex
0 1
1 5
2 7
Assign folder_path
columns with concatenated values:
loc_stat = df.location_statindex
df['folder_path'] = np.where(loc_stat.isin(['1','2','3','4']),
'Platz_optimiert_' loc_stat,
np.where(loc_stat.isin(['5', '6']), 'KSPlatz_' loc_stat, 'Platz_optimiert_TEF_' loc_stat)) \
'\\' df[['year', 'month', 'day', 'hour']].agg('\\'.join, axis=1)
In [488]: df
Out[488]:
id year month day hour timestamp \
0 TP6045276120589011 2022 9 1 2 2022-09-01 02:01:23
1 TP6031280073133032 2022 12 16 1 2022-12-16 01:48:42
2 TP6031280073133032 2022 12 16 1 2022-12-16 01:48:42
location_statindex folder_path
0 1 Platz_optimiert_1\2022\9\1\2
1 5 KSPlatz_5\2022\12\16\1
2 7 Platz_optimiert_TEF_7\2022\12\16\1