Can pandas findall() return a str instead of list?-CodePudding

I have a pandas dataframe containing a lot of variables:

df.columns
Out[0]: 
Index(['COUNADU_SOIL_P_NUMBER_16_DA_B_VE_count_nr_lesion_PRATZE',
       'COUNEGG_SOIL_P_NUMBER_50_DA_B_VT_count_nr_lesion_PRATZE',
       'COUNJUV_SOIL_P_NUMBER_128_DA_B_V6_count_nr_lesion_PRATZE',
       'COUNADU_SOIL_P_SAUDPC_150_DA_B_V6_lesion_saudpc_PRATZE',
       'CONTRO_SOIL_P_pUNCK_150_DA_B_V6_lesion_p_control_PRATZE',
       'COUNJUV_SOIL_P_p_0_100_16_DA_B_V6_lesion_incidence_PRATZE',
       'COUNADU_SOIL_P_p_0_100_50_DA_B_VT_lesion_incidence_PRATZE',
       'COUNEGG_SOIL_P_p_0_100_128_DA_B_VT_lesion_incidence_PRATZE',
       'COUNEGG_SOIL_P_NUMBER_50_DA_B_V6_count_nr_spiral_HELYSP',
       'COUNJUV_SOIL_P_NUMBER_128_DA_B_V10_count_nr_spiral_HELYSP', # and so on

I would like to keep only the number followed by DA, so the first column is 16_DA. I have been using the pandas function findall():

df.columns.str.findall(r'[0-9]*\_DA')
Out[595]: 
Index([ ['16_DA'],  ['50_DA'], ['128_DA'], ['150_DA'], ['150_DA'],
        ['16_DA'],  ['50_DA'], ['128_DA'],  ['50_DA'], ['128_DA'], ['150_DA'],
        ['150_DA'],  ['50_DA'], ['128_DA'],

But this returns a list, which i would like to avoid, so that i end up with a column index looking like this:

df.columns
Out[595]: 
Index('16_DA',  '50_DA', '128_DA', '150_DA', '150_DA',
      '16_DA',  '50_DA', '128_DA',  '50_DA', '128_DA', '150_DA',

Is there a smoother way to do this?

CodePudding user response：

You can use .str.join(", ") to join all found matches with a comma and space:

df.columns.str.findall(r'\d _DA').str.join(", ")

Or, just use str.extract to get the first match:

df.columns.str.extract(r'(\d _DA)', expand=False)

CodePudding user response：

from typing import List


pattern = r'[0-9]*\_DA'
flattened: List[str] = sum(df.columns.str.findall(pattern), [])
output: str = ",".join(flattened)

CodePudding user response：

yet other approach:

def check_name(col: str) -> bool:
    cond1 = col.split("_")[1].__eq__("DA")
    cond2 = col.split("_")[0].isdigit()
    return cond1 and cond2


list(filter(lambda col: check_name, df.columns))