I have pd.DataFrame consisting of 87 columns, which has 5 group of columns with names:
1)first_stage.output.quantity_1
2)first_stage.input.quantity_2
......
18)first_stage.output.recovery_rate
19)first_stage.input_quantity_1
20)first_stage.input_quantity_2
....
49)first_stage.output_concentration
50)second_stage.quantity_1
.....
72)second_stage.output.recovery_rate
73)initial.quantity_concentrate_sub_1
...
87)initital.output_conctntration
First word in column name means the name of physical process. I need to show only columns with particular process (first_stage, second_stage,initial, final_stage). What can be done? Probably some regular expressions should be used, but I failed to implement it.
CodePudding user response:
Use filter
:
Suppose the dataframe below:
>>> df
first_stage.output.quantity_1 first_stage.input.quantity_2 first_stage.output.recovery_rate ... second_stage.output.recovery_rate initial.quantity_concentrate_sub_1 initital.output_conctntration
0 4 7 8 ... 5 9 6
filtered_df = df.filter(regex=r'^(?:first|second|final)_stage')
# Output:
first_stage.output.quantity_1 first_stage.input.quantity_2 first_stage.output.recovery_rate ... first_stage.output_concentration second_stage.quantity_1 second_stage.output.recovery_rate
0 4 7 8 ... 3 4 5
[1 rows x 8 columns]
Detail of regex:
^
match the start of the line(?:...)
create a non-capturing groupfirst|second|initial
match one of the words_stage
followed by the word '_stage'
CodePudding user response:
A simple solution to this would be to just simply check for each column ID.
df = pd.DataFrame(...) #your dataframe
first_stage = [col for col in df.columns if col.startswith('first_stage')]
display(df[first_stage])
Then, you can just repeat this for the other types and you're done.
CodePudding user response:
A step-by-step solution if you do not want to use regular expressions:
# this gives you the names of the columns of your dataframe in a list
column_name_list= list(initial_data)
# this list contains the names you want to filter
column_names_to_filter = ["first_stage", "second_stage", "initial", "final_stage"]
# empty dataframe to store only the data you want
filtered_df = pd.DataFrame()
# looping through each of the column_name of your initial dataframe
for actual_column in column_name_list:
# if the column_name matches with any of the names you want to catch (from the list declared before)
# store that column's data to your new dataframe
if any(check_column in actual_column for check_column in column_names_to_filter):
filtered_df[actual_column] = data[actual_column]
print(filtered_df)