I have a dataframe with column names such as follow
import pandas as pd
data = [['30-06-2021', 3.4, 43578, '31-01-2022', 5000, '28-02-2022', 78564, '31-03-2022', 52353, '30-04-2022'],
['14-06-2021', 8.9, 4475, '14-01-2022', 2546, '05-02-2022', 5757, '28-03-2022', 2352, '01-04-2022']]
ds = pd.DataFrame(data, columns = ['Start', 'Rate', 'Jan-22Total', 'Jan-22', 'Feb-22Total', 'Feb-22', 'Mar-22Total', 'Mar-22',
'Apr-22Total', 'Apr-22'])
Start Rate Jan-22Total Jan-22 Feb-22Total Feb-22 Mar-22Total Mar-22 Apr-22Total Apr-2
0 30-06-2021 3.4 43578 31-01-2022 5000 28-02-2022 78564 31-03-2022 52353 30-04-2022
1 14-06-2021 8.9 4475 14-01-2022 2546 05-02-2022 5757 28-03-2022 2352 01-04-2022
I want to find a list that contains a certain string such as Jan-22Total
, Jan-22
, Feb-22Total
, Feb-22
, Mar-22Total
, Mar-22
and so on. Basically a %b-%y
format.
I've tried a few code but still no clue. The list still didn't come out.
Here's what I've tried so far, but the result shows []
only.
total_col = [col for col in ds.columns if '%b-%y' in col]
print(list(ds.columns))
print(total_col)
How should I list the column names?
CodePudding user response:
simple approach
A crude but maybe sufficient way if you do not need to ensure valid dates could be to use a short regex:
m = ds.columns.str.match(r'\w -\d ')
cols = ds.columns[m]
Output: ['Jan-22Total', 'Jan-22', 'Feb-22Total', 'Feb-22', 'Mar-22Total', 'Mar-22', 'Apr-22Total', 'Apr-22']
A slightly more discriminant regex could be r'[A-Z][a-z]{2}-\d '
(3 letters capitalized month abbreviation)
robust approach
A more robust approach would be to validate the dates with pandas.to_datetime
:
m = (pd.to_datetime(ds.columns.str.extract(r'^(\w -\d )',
expand=False), format='%b-%y', errors='coerce')
.notna()
)
cols = ds.columns[m]