How to return columns with with similar names in a dataframe using pandas-CodePudding

Say I have a table that looks something like:

 ---------------------------------- ------------------------------------- ---------------------------------- 
| ExperienceModifier|ApplicationId | ExperienceModifier|RatingModifierId | ExperienceModifier|ActionResults |
 ---------------------------------- ------------------------------------- ---------------------------------- 
|                                  |                                     |                                  |
 ---------------------------------- ------------------------------------- ----------------------------------

I would like to grab all of the columns that lead with 'ExperienceModifier' and stuff the results of that into its own dataframe. How would I accomplish this with pandas?

CodePudding user response：

You can try pandas.DataFrame.filter

df.filter(like='ExperienceModifier')

If you want to get columns only contains ExperienceModifier at the beginning.

df.filter(regex='^ExperienceModifier')

CodePudding user response：

Ynjxsjmh's answer will get all columns that contain "ExperienceModifier". If you literally want columns that start with that string, rather merely contain it, you can do new_df = df[[col for col in df.columns if col[:18] == 'ExperienceModifier']]. If all of the desired columns have | after "ExperienceModifier", you could also do new_df = df[[col for col in df.columns if col.split('|')[0] == 'ExperienceModifier']]. All of these will create a view of the dataframe. If you want a completely separate dataframe, you should copy it, like this: new_df = df[[col for col in df.columns if col.split('|')[0] == 'ExperienceModifier']].copy(). You also might want to create a multi-index by splitting the column names on | rather than creating a separate dataframe.

CodePudding user response：

The accepted answer does easly the job but I still attach my "hand made version" that works:

import pandas as pd
import numpy as np
import re

lst = [[1, 2, 3, 4],[1, 2, 3, 4],[1, 2, 3, 4]]
column_names = [['ExperienceModifier|ApplicationId', 'ExperienceModifier|RatingModifierId', 'ExperienceModifier|ActionResults','OtherName|ActionResults']] 

data = pd.DataFrame(lst, columns = column_names) 
data


old_and_dropped_dataframes = []
new_dataframes=[]    
for i in np.arange(0,len(column_names[0])):
    column_names[0][i].split("|")
    splits=re.findall(r"[\w'] ", column_names[0][i])
    if "ExperienceModifier" in splits:
        new_dataframe = data.iloc[:,[i]]
        new_dataframes.append(new_dataframe)
    else:
        old_and_dropped_dataframe =  data.iloc[:,[i]]
        old_and_dropped_dataframes.append(old_and_dropped_dataframe)



ExperienceModifier_dataframe = pd.concat(new_dataframes,axis=1)
ExperienceModifier_dataframe

OtherNames_dataframe = pd.concat(old_and_dropped_dataframes,axis=1)
OtherNames_dataframe

This script creates two new dataframes starting from the initial dataframe: one that contains the columns whose names start with ExperienceModifier and an other one that contains the columns that do not start with ExperienceModifier.