Split columns by space or dash - python-CodePudding

I have a pandas df with mixed formatting for a specific column. It contains the qtr and year. I'm hoping to split this column into separate columns. But the formatting contains a space or a second dash between qtr and year.

I'm hoping to include a function that splits the column by a blank space or a second dash.

df = pd.DataFrame({              
'Qtr' : ['APR-JUN 2019','JAN-MAR 2019','JAN-MAR 2015','JUL-SEP-2020','OCT-DEC 2014','JUL-SEP-2015'],                           
})

out:

            Qtr
0  APR-JUN 2019 # blank
1  JAN-MAR 2019 # blank
2  JAN-MAR 2015 # blank
3  JUL-SEP-2020 # second dash
4  OCT-DEC 2014 # blank
5  JUL-SEP-2015 # second dash

split by blank

df[['Qtr', 'Year']] = df['Qtr'].str.split(' ', 1, expand=True)

split by second dash

df[['Qtr', 'Year']] = df['Qtr'].str.split('-', 1, expand=True)

intended output:

            Qtr  Year
0       APR-JUN  2019
1       JAN-MAR  2019
2       JAN-MAR  2015
3       JUL-SEP  2020
4       OCT-DEC  2014
5       JUL-SEP  2015

CodePudding user response：

You can use a regular expression with the extract function of the string accessor.

df[['Qtr', 'Year']] = df['Qtr'].str.extract(r'(\w{3}-\w{3}).{1}(\d{4})')
print(df)

Result

       Qtr  Year
0  APR-JUN  2019
1  JAN-MAR  2019
2  JAN-MAR  2015
3  JUL-SEP  2020
4  OCT-DEC  2014
5  JUL-SEP  2015

CodePudding user response：

You can split using regex using positive lookahead and non capturing group (?:..), then filter out the empty values, and apply a pandas Series on the values:

>>> (df.Qtr.str.split('\s|(. (?<=-). )(?:-)')
    .apply(lambda x: [i for i in x if i])
    .apply(lambda x: pd.Series(x, index=['Qtr', 'Year']))
    )

       Qtr  Year
0  APR-JUN  2019
1  JAN-MAR  2019
2  JAN-MAR  2015
3  JUL-SEP  2020
4  OCT-DEC  2014
5  JUL-SEP  2015

CodePudding user response：

If, and only if, the data is in the posted format you could use list slicing.

import pandas as pd

df = pd.DataFrame(
    {
        "Qtr": [
            "APR-JUN 2019",
            "JAN-MAR 2019",
            "JAN-MAR 2015",
            "JUL-SEP-2020",
            "OCT-DEC 2014",
            "JUL-SEP-2015",
        ],
    }
)

df[['Qtr', 'Year']] = [(x[:7], x[8:12])  for x in df['Qtr']]

print(df)

       Qtr  Year
0  APR-JUN  2019
1  JAN-MAR  2019
2  JAN-MAR  2015
3  JUL-SEP  2020
4  OCT-DEC  2014
5  JUL-SEP  2015