Create a loop to subsets a Dataframe based a boolean value in Python-CodePudding

I have a problem coding a loop to subset a Dataframe in Python.

This is my first post on stack overflow and I have started to code fews months ago so I am sorry if I am doing something wrong ..! I have looked over the web for days now but couldn't find an answer (my keywords might have been poorly chosen..)

To give some context, here is how I obtained my df from a csv file:

#Library

import pandas as pd
import numpy as np

#Assisgn spreadsheets filenames and read files into a Dataframe

file_20 = '/Users/cortana/Desktop/Projet stage/DAT/dat_clean/donnees_assemblees_20.csv'
df_20_initial = pd.read_csv(file_20, sep=';', usecols=[0, 2, 3])

#Create dictionary with tables names as keys

tables_names_20 = pd.DataFrame.dropna(df_20_initial.iloc[:,[0]])
tables_names_20 = tables_names_20.set_index('20').T.to_dict()

#Slice the global dataframe and store the subsets into the dictionary as values

df_20_initial['separators'] = df_20_initial['time'].isna() #add a new column that check for missing values (separators)

print(df_20_initial)

Here is what my df looks like:

       20      time  velocity  separators
0    P1S1  6.158655  0.136731       False
1     NaN  6.179028  0.244889       False
2     NaN  6.199253  0.386443       False
3     NaN  6.219323  0.571861       False
4     NaN  6.239505  0.777680       False
..    ...       ...       ...         ...
520   NaN  7.008377  1.423408       False
521   NaN  7.028759  1.180113       False
522   NaN  7.048932  0.929300       False
523   NaN  7.068993  0.673909       False
524   NaN  7.089557  0.413527       False

[525 rows x 4 columns]

Based on the boolean value present in the "separators" column, I would like to create a new Dataframe containing the values of the "time" and "velocity" column, sliced when the "separators" value is True.

To do so, I have unsuccessfully tried to code the following loop:

for lab, row in df_20_initial.iterrows() :
    if df_20_initial.iloc[:,3] == False :
        P1S1 = df_20_intermediate[['time', 'velocity']]
    else :
      break

... and got this error message from Python:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Any advices is welcome, and thank you all in advance for your time!

CodePudding user response：

Pandas is really good at Boolean slices. If I understand your question correctly, I think all you need is:

new_df = df_20_initial[df_20_initial['seperators']]

If you want to remove the 'seperators' column from the output, you can just select the remaining columns like so:

new_df = df_20_initial[df_20_initial['seperators']][['time', 'velocity']]

CodePudding user response：

For my experiments I used your DataFrame with separators set to True in some rows:

     20      time  velocity  separators
0  P1S1  6.158655  0.136731       False
1   NaN  6.179028  0.244889       False
2   NaN  6.199253  0.386443       False
3   NaN  6.219323  0.571861        True
4   NaN  6.239505  0.777680       False
5   NaN  7.008377  1.423408       False
6   NaN  7.028759  1.180113       False
7   NaN  7.048932  0.929300        True
8   NaN  7.068993  0.673909       False
9   NaN  7.089557  0.413527       False

I assumed that separators column is of bool type.

To generate a list of chunks you can use e.g. the following list comprehension:

dfList = [ chunk[['time', 'velocity']] for _, chunk in
    df_20_initial.groupby(df_20_initial.separators.cumsum()) ]

Now when you e.g. print dfList[1] you will get:

       time  velocity
3  6.219323  0.571861
4  6.239505  0.777680
5  7.008377  1.423408
6  7.028759  1.180113

But if you want to drop separator rows, run:

dfList2 = [ chunk[~chunk.separators][['time', 'velocity']] for _, chunk in
    df_20_initial.groupby(df_20_initial.separators.cumsum()) ]

(from each chunk leave only rows with separators == False).