I have a data frame column in the below format:
header |
---|
THIS IS an example |
ΕΛΕΥΘΕΡΙΟ Άγιος Γεώργιος |
ALSO this |
ONE LAST |
J. one more |
I want to split it into two columns:
header1 | header2 |
---|---|
THIS IS | an example |
ALSO | this |
ONE LAST | null |
null | J. one more |
I have tried extracting the information like this:
df1['header'].str.split('[A-Z]', expand=True)
but my regular expressions are not up to par. Any help is much appreciated!
CodePudding user response:
Using str.extract
we can try:
df["header1"] = df["header"].str.extract(r'^([A-Z] (?: [A-Z] )?)')
df["header2"] = df["header"].str.extract(r'\b([a-z] (?: [a-z] )?)')
CodePudding user response:
Greek Letter Notice
To only match Greek uppercase letters, replace [A-Z]
in below patterns with [\u0391-\u03A1\u03A3-\u03A9]
. To match both ASCII and Greek uppercase letters use [a-zA-Z\u0391-\u03A1\u03A3-\u03A9]
.
I.e.
rx = r'^\s*(?P<header1>(?:[\u0391-\u03A1\u03A3-\u03A9] \b(?!\.)(?:\s [\u0391-\u03A1\u03A3-\u03A9] )*\b)?)(?:\s (?P<header2>.*))?'
new_df = df['header'].str.extract(rx, expand=True)
See the regex demo.
You can use
df[['header1', 'header2']] = df['header'].str.extract(r'^\s*((?:[A-Z] \b(?!\.)(?:\s [A-Z] )*)?)\s*(.*)', expand=True)
Output:
>>> df
header header1 header2
0 THIS IS an example THIS IS an example
1 ALSO this ALSO this
2 ONE LAST ONE LAST
3 J. one more J. one more
See the regex demo.
Details:
^
- start of string\s*
- zero or more whitespaces((?:[A-Z] \b(?!\.)(?:\s [A-Z] )*)?)
- Group 1 (header1
): an optional sequence of one or more uppercase ASCII letters (not followed with a.
char) and then zero or more sequences of one or more whitespaces and one or more uppercase ASCII letters\s*
- zero or more whitespaces(.*)
- Group 2 (header2
): any zero or more chars other than line break chars, as many as possible.
You may extract to a new dataframe using named capturing groups:
>>> new_df = df['header'].str.extract(r'^\s*(?P<header1>(?:[A-Z] \b(?!\.)(?:\s [A-Z] )*)?)\s*(?P<header2>.*)', expand=True)
>>> new_df
header1 header2
0 THIS IS an example
1 ALSO this
2 ONE LAST
3 J. one more
CodePudding user response:
You can also use 2 named capture groups, and join the columns.
^(?P<header1>[A-Z] (?:[^\S\n] [A-Z] )*)?(?:(?:^|[^\S\n] )(?P<header2>. ))?$
(or use [a-z].*
instead of .
if it must start with a lowercase char)
^
Start of string(?P<header1>[A-Z]
Capture group header1, match 1 chars A-Z(?:[^\S\n] [A-Z] )*)?
Optionally match spaces and 1 chars A-Z(?:
Non capture group(?:^|[^\S\n] )
Either assert the start of the string or match 1 spaces(?P<header2>. )
Named group header2 match 1 chars
)?
Close group and make it optional$
End of string
See a regex demo and a Python demo.
Example
import pandas as pd
strings = [
"THIS IS an example",
"ALSO this",
"ONE LAST",
"J. one more"
]
df1 = pd.DataFrame(strings, columns=["header"])
df1 = df1.join(
df1['header'].str.extract(
'^(?P<header1>[A-Z] (?:[^\S\n] [A-Z] )*)?(?:(?:^|[^\S\n] )(?P<header2>. ))?$',
expand=True
)
.fillna('')
)
print(df1)
Output
header header1 header2
0 THIS IS an example THIS IS an example
1 ALSO this ALSO this
2 ONE LAST ONE LAST
3 J. one more J. one more