How to show columns in pd.DataFrame, containing special phrase in column name? [duplicate]-CodePudding

I have pd.DataFrame consisting of 87 columns, which has 5 group of columns with names:

1)first_stage.output.quantity_1
2)first_stage.input.quantity_2
......
18)first_stage.output.recovery_rate
19)first_stage.input_quantity_1
20)first_stage.input_quantity_2
....
49)first_stage.output_concentration
50)second_stage.quantity_1
.....
72)second_stage.output.recovery_rate
73)initial.quantity_concentrate_sub_1
...
87)initital.output_conctntration

First word in column name means the name of physical process. I need to show only columns with particular process (first_stage, second_stage,initial, final_stage). What can be done? Probably some regular expressions should be used, but I failed to implement it.

CodePudding user response：

Use filter:

Suppose the dataframe below:

>>> df
   first_stage.output.quantity_1  first_stage.input.quantity_2  first_stage.output.recovery_rate  ...  second_stage.output.recovery_rate  initial.quantity_concentrate_sub_1  initital.output_conctntration
0                              4                             7                                 8  ...                                  5                                   9                              6

filtered_df = df.filter(regex=r'^(?:first|second|final)_stage')

# Output:
   first_stage.output.quantity_1  first_stage.input.quantity_2  first_stage.output.recovery_rate  ...  first_stage.output_concentration  second_stage.quantity_1  second_stage.output.recovery_rate
0                              4                             7                                 8  ...                                 3                        4                                  5

[1 rows x 8 columns]

Detail of regex:

^ match the start of the line
(?:...) create a non-capturing group
first|second|initial match one of the words
_stage followed by the word '_stage'

CodePudding user response：

A simple solution to this would be to just simply check for each column ID.

df = pd.DataFrame(...) #your dataframe
first_stage = [col for col in df.columns if col.startswith('first_stage')]
display(df[first_stage])

Then, you can just repeat this for the other types and you're done.

CodePudding user response：

A step-by-step solution if you do not want to use regular expressions:

# this gives you the names of the columns of your dataframe in a list
column_name_list= list(initial_data)

# this list contains the names you want to filter
column_names_to_filter = ["first_stage", "second_stage", "initial", "final_stage"]

# empty dataframe to store only the data you want
filtered_df = pd.DataFrame()

# looping through each of the column_name of your initial dataframe
for actual_column in column_name_list:

    # if the column_name matches with any of the names you want to catch (from the list declared before)
    # store that column's data to your new dataframe
    if any(check_column in actual_column for check_column in column_names_to_filter):
        filtered_df[actual_column] = data[actual_column]

print(filtered_df)