Home > Mobile >  join aggregation with if condition on one column pandas
join aggregation with if condition on one column pandas

Time:01-19

I have a pandas dataframe that looks like this:

id year month day hour timestamp location_statindex
TP6045276120589011 2022 09 01 02 2022-09-01 02:01:23 1
TP6031280073133032 2022 12 16 01 2022-12-16 01:48:42 1

with around 300.000 rows.
With dytpes

id                             object
year                           object
month                          object
day                            object
hour                           object
timestamp                      datetime64[ns]
location_statindex             object
dtype: object

Now I create a new column which contains elements from each row, like:
1\2022\09\01\02
with this code:
df['folder_path'] = df[['location_statindex', 'year', 'month', 'day', 'hour']].agg('\\'.join, axis=1)
The question is now if it is possible to change the aggregation, so that the value from location_statindex is a different string depending on an if condition.
Like this:

location = '2'
if location in ['1','2','3','4']:
    location = f'Platz_optimiert_{location}'
elif location in ['5','6']:
    location = f'KSPlatz_{location}'
else:
    location = f'Platz_optimiert_TEF_{location}'

Expected output is a new column in the same dataframe with a joined string from the element of the corresponing row. For example the following string for the first row. 'Platz_optimiert_1\2022\09\01\02'

I already have a solution that uses the apply function from pandas with a self written function which contains the if block. But thinking about the size of the dataframe I want to try to reduce the necessary time. Is aggregation even faster than apply?

CodePudding user response:

With single assignment, by means of numpy.where routine.

Sample dataframe (note different location_statindex values for representative test):

In [473]: df
Out[473]: 
                    id  year month day hour             timestamp  \
0  TP6045276120589011   2022     9   1    2  2022-09-01 02:01:23    
1  TP6031280073133032   2022    12  16    1  2022-12-16 01:48:42    
2  TP6031280073133032   2022    12  16    1  2022-12-16 01:48:42    

  location_statindex  
0                  1  
1                  5  
2                  7

Assign folder_path columns with concatenated values:

loc_stat = df.location_statindex
df['folder_path'] = np.where(loc_stat.isin(['1','2','3','4']), 
                             'Platz_optimiert_'   loc_stat, 
                             np.where(loc_stat.isin(['5', '6']), 'KSPlatz_'   loc_stat, 'Platz_optimiert_TEF_'   loc_stat)) \
                      '\\'   df[['year', 'month', 'day', 'hour']].agg('\\'.join, axis=1)

In [488]: df
Out[488]: 
                    id  year month day hour             timestamp  \
0  TP6045276120589011   2022     9   1    2  2022-09-01 02:01:23    
1  TP6031280073133032   2022    12  16    1  2022-12-16 01:48:42    
2  TP6031280073133032   2022    12  16    1  2022-12-16 01:48:42    

  location_statindex                         folder_path  
0                  1        Platz_optimiert_1\2022\9\1\2  
1                  5              KSPlatz_5\2022\12\16\1  
2                  7  Platz_optimiert_TEF_7\2022\12\16\1
  • Related