Home > Software engineering >  Loop through dataframe to identify subgroup and label with unique identifier
Loop through dataframe to identify subgroup and label with unique identifier

Time:09-17

I'm trying to complete the Sessions column with a unique integer per session for further processing.

A session is defined by one day or a period from 9:30-16:00

        Symbol                Time   Open   High    Low  Close  Volume  LOD  Sessions
2724312   AEHR 2019-09-23 09:31:00   1.42   1.42   1.42   1.42     200  NaN       NaN
2724313   AEHR 2019-09-23 09:43:00   1.35   1.35   1.34   1.34    6062  NaN       NaN
2724314   AEHR 2019-09-23 09:58:00   1.35   1.35   1.29   1.30    8665  NaN       NaN
2724315   AEHR 2019-09-23 09:59:00   1.32   1.32   1.32   1.32     100  NaN       NaN
2724316   AEHR 2019-09-23 10:00:00   1.35   1.35   1.35   1.35     400  NaN       NaN
...        ...                 ...    ...    ...    ...    ...     ...  ...       ...

I've tried everything using loop but I keep getting every KeyError and SettingWithCopyWarning in the book.


Edit: Error & Code added

SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy self._setitem_single_block(indexer, value, name)

df = df
# Columns ['Symbol', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume', 'LOD', 'Sessions']

# Add Date column to loop through
df['Date'] = pd.to_datetime(df['Time']).dt.date

previous_session = df['Date'].iloc[0]
prev_sesh_count = 1

for i, row in df.iterrows():
    current_session = df['Date'].iloc[i]
    if previous_session == current_session:
        df['Sessions'].iloc[i] = prev_sesh_count
    else:
        df['Sessions'].iloc[i] = prev_sesh_count   1
        prev_sesh_count = prev_sesh_count   1

CodePudding user response:

Assuming the dataframe is sorted on Date, we can use duplicated along with cumsum to assign the unqiue sessions numbers

df['Sessions'] = (~df.duplicated(['Symbol', 'Date'])).cumsum()

print(df)

        Symbol                 Time  Open  High   Low  Close  Volume  LOD  Sessions        Date
2724312   AEHR  2019-09-23 09:31:00  1.42  1.42  1.42   1.42     200  NaN         1  2019-09-23
2724313   AEHR  2019-09-23 09:43:00  1.35  1.35  1.34   1.34    6062  NaN         1  2019-09-23
2724314   AEHR  2019-09-23 09:58:00  1.35  1.35  1.29   1.30    8665  NaN         1  2019-09-23
2724315   AEHR  2019-09-23 09:59:00  1.32  1.32  1.32   1.32     100  NaN         1  2019-09-23
2724316   AEHR  2019-09-23 10:00:00  1.35  1.35  1.35   1.35     400  NaN         1  2019-09-23
  • Related