Find list of column names with month-year string-CodePudding

I have a dataframe with column names such as follow

import pandas as pd 

data = [['30-06-2021', 3.4, 43578, '31-01-2022', 5000, '28-02-2022', 78564, '31-03-2022', 52353, '30-04-2022'], 
        ['14-06-2021', 8.9, 4475, '14-01-2022', 2546, '05-02-2022', 5757, '28-03-2022', 2352, '01-04-2022']]
ds = pd.DataFrame(data, columns = ['Start', 'Rate', 'Jan-22Total', 'Jan-22', 'Feb-22Total', 'Feb-22', 'Mar-22Total', 'Mar-22',
                                   'Apr-22Total', 'Apr-22'])

    Start       Rate Jan-22Total  Jan-22   Feb-22Total  Feb-22  Mar-22Total Mar-22   Apr-22Total  Apr-2
0   30-06-2021  3.4    43578    31-01-2022   5000     28-02-2022  78564   31-03-2022    52353   30-04-2022
1   14-06-2021  8.9    4475     14-01-2022   2546     05-02-2022   5757   28-03-2022    2352    01-04-2022

I want to find a list that contains a certain string such as Jan-22Total, Jan-22, Feb-22Total, Feb-22, Mar-22Total, Mar-22 and so on. Basically a %b-%y format.

I've tried a few code but still no clue. The list still didn't come out.

Here's what I've tried so far, but the result shows [] only.

total_col = [col for col in ds.columns if '%b-%y' in col]
print(list(ds.columns))
print(total_col)

How should I list the column names?

CodePudding user response：

simple approach

A crude but maybe sufficient way if you do not need to ensure valid dates could be to use a short regex:

m = ds.columns.str.match(r'\w -\d ')
cols = ds.columns[m]

Output: ['Jan-22Total', 'Jan-22', 'Feb-22Total', 'Feb-22', 'Mar-22Total', 'Mar-22', 'Apr-22Total', 'Apr-22']

A slightly more discriminant regex could be r'[A-Z][a-z]{2}-\d ' (3 letters capitalized month abbreviation)

robust approach

A more robust approach would be to validate the dates with pandas.to_datetime:

m = (pd.to_datetime(ds.columns.str.extract(r'^(\w -\d )',
                    expand=False), format='%b-%y', errors='coerce')
        .notna()
     )
cols = ds.columns[m]