I have a problem coding a loop to subset a Dataframe in Python.
This is my first post on stack overflow and I have started to code fews months ago so I am sorry if I am doing something wrong ..! I have looked over the web for days now but couldn't find an answer (my keywords might have been poorly chosen..)
To give some context, here is how I obtained my df from a csv file:
#Library
import pandas as pd
import numpy as np
#Assisgn spreadsheets filenames and read files into a Dataframe
file_20 = '/Users/cortana/Desktop/Projet stage/DAT/dat_clean/donnees_assemblees_20.csv'
df_20_initial = pd.read_csv(file_20, sep=';', usecols=[0, 2, 3])
#Create dictionary with tables names as keys
tables_names_20 = pd.DataFrame.dropna(df_20_initial.iloc[:,[0]])
tables_names_20 = tables_names_20.set_index('20').T.to_dict()
#Slice the global dataframe and store the subsets into the dictionary as values
df_20_initial['separators'] = df_20_initial['time'].isna() #add a new column that check for missing values (separators)
print(df_20_initial)
Here is what my df looks like:
20 time velocity separators
0 P1S1 6.158655 0.136731 False
1 NaN 6.179028 0.244889 False
2 NaN 6.199253 0.386443 False
3 NaN 6.219323 0.571861 False
4 NaN 6.239505 0.777680 False
.. ... ... ... ...
520 NaN 7.008377 1.423408 False
521 NaN 7.028759 1.180113 False
522 NaN 7.048932 0.929300 False
523 NaN 7.068993 0.673909 False
524 NaN 7.089557 0.413527 False
[525 rows x 4 columns]
Based on the boolean value present in the "separators" column, I would like to create a new Dataframe containing the values of the "time" and "velocity" column, sliced when the "separators" value is True.
To do so, I have unsuccessfully tried to code the following loop:
for lab, row in df_20_initial.iterrows() :
if df_20_initial.iloc[:,3] == False :
P1S1 = df_20_intermediate[['time', 'velocity']]
else :
break
... and got this error message from Python:
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any advices is welcome, and thank you all in advance for your time!
CodePudding user response:
Pandas is really good at Boolean slices. If I understand your question correctly, I think all you need is:
new_df = df_20_initial[df_20_initial['seperators']]
If you want to remove the 'seperators' column from the output, you can just select the remaining columns like so:
new_df = df_20_initial[df_20_initial['seperators']][['time', 'velocity']]
CodePudding user response:
For my experiments I used your DataFrame with separators set to True in some rows:
20 time velocity separators
0 P1S1 6.158655 0.136731 False
1 NaN 6.179028 0.244889 False
2 NaN 6.199253 0.386443 False
3 NaN 6.219323 0.571861 True
4 NaN 6.239505 0.777680 False
5 NaN 7.008377 1.423408 False
6 NaN 7.028759 1.180113 False
7 NaN 7.048932 0.929300 True
8 NaN 7.068993 0.673909 False
9 NaN 7.089557 0.413527 False
I assumed that separators column is of bool type.
To generate a list of chunks you can use e.g. the following list comprehension:
dfList = [ chunk[['time', 'velocity']] for _, chunk in
df_20_initial.groupby(df_20_initial.separators.cumsum()) ]
Now when you e.g. print dfList[1] you will get:
time velocity
3 6.219323 0.571861
4 6.239505 0.777680
5 7.008377 1.423408
6 7.028759 1.180113
But if you want to drop separator rows, run:
dfList2 = [ chunk[~chunk.separators][['time', 'velocity']] for _, chunk in
df_20_initial.groupby(df_20_initial.separators.cumsum()) ]
(from each chunk leave only rows with separators == False).