I have a pandas df with mixed formatting for a specific column. It contains the qtr and year. I'm hoping to split this column into separate columns. But the formatting contains a space or a second dash between qtr and year.
I'm hoping to include a function that splits the column by a blank space or a second dash.
df = pd.DataFrame({
'Qtr' : ['APR-JUN 2019','JAN-MAR 2019','JAN-MAR 2015','JUL-SEP-2020','OCT-DEC 2014','JUL-SEP-2015'],
})
out:
Qtr
0 APR-JUN 2019 # blank
1 JAN-MAR 2019 # blank
2 JAN-MAR 2015 # blank
3 JUL-SEP-2020 # second dash
4 OCT-DEC 2014 # blank
5 JUL-SEP-2015 # second dash
split by blank
df[['Qtr', 'Year']] = df['Qtr'].str.split(' ', 1, expand=True)
split by second dash
df[['Qtr', 'Year']] = df['Qtr'].str.split('-', 1, expand=True)
intended output:
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
CodePudding user response:
You can use a regular expression with the extract
function of the string accessor.
df[['Qtr', 'Year']] = df['Qtr'].str.extract(r'(\w{3}-\w{3}).{1}(\d{4})')
print(df)
Result
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
CodePudding user response:
You can split using regex using positive lookahead and non capturing group (?:..)
, then filter out the empty values, and apply a pandas Series on the values:
>>> (df.Qtr.str.split('\s|(. (?<=-). )(?:-)')
.apply(lambda x: [i for i in x if i])
.apply(lambda x: pd.Series(x, index=['Qtr', 'Year']))
)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015
CodePudding user response:
If, and only if, the data is in the posted format you could use list slicing.
import pandas as pd
df = pd.DataFrame(
{
"Qtr": [
"APR-JUN 2019",
"JAN-MAR 2019",
"JAN-MAR 2015",
"JUL-SEP-2020",
"OCT-DEC 2014",
"JUL-SEP-2015",
],
}
)
df[['Qtr', 'Year']] = [(x[:7], x[8:12]) for x in df['Qtr']]
print(df)
Qtr Year
0 APR-JUN 2019
1 JAN-MAR 2019
2 JAN-MAR 2015
3 JUL-SEP 2020
4 OCT-DEC 2014
5 JUL-SEP 2015