Splitting a pandas Data frame column depending if word is capital or not-CodePudding

I have a data frame column in the below format:

header
THIS IS an example
ALSO this
ONE LAST
J. one more

I want to split it into two columns:

header1	header2
THIS IS	an example
ALSO	this
ONE LAST	null
null	J. one more

I have tried extracting the information like this:

df1['header'].str.split('[A-Z]', expand=True)

but my regular expressions are not up to par. Any help is much appreciated!

CodePudding user response：

Using str.extract we can try:

df["header1"] = df["header"].str.extract(r'^([A-Z] (?: [A-Z] )?)')
df["header2"] = df["header"].str.extract(r'\b([a-z] (?: [a-z] )?)')

CodePudding user response：

Greek Letter Notice

To only match Greek uppercase letters, replace [A-Z] in below patterns with [\u0391-\u03A1\u03A3-\u03A9]. To match both ASCII and Greek uppercase letters use [a-zA-Z\u0391-\u03A1\u03A3-\u03A9].

I.e.

rx = r'^\s*(?P<header1>(?:[\u0391-\u03A1\u03A3-\u03A9] \b(?!\.)(?:\s [\u0391-\u03A1\u03A3-\u03A9] )*\b)?)(?:\s (?P<header2>.*))?'
new_df = df['header'].str.extract(rx, expand=True)

See the regex demo.

You can use

df[['header1', 'header2']] = df['header'].str.extract(r'^\s*((?:[A-Z] \b(?!\.)(?:\s [A-Z] )*)?)\s*(.*)', expand=True)

Output:

>>> df
               header   header1      header2
0  THIS IS an example   THIS IS   an example
1           ALSO this      ALSO         this
2            ONE LAST  ONE LAST             
3         J. one more            J. one more

See the regex demo.

Details:

^ - start of string
\s* - zero or more whitespaces
((?:[A-Z] \b(?!\.)(?:\s [A-Z] )*)?) - Group 1 (header1): an optional sequence of one or more uppercase ASCII letters (not followed with a . char) and then zero or more sequences of one or more whitespaces and one or more uppercase ASCII letters
\s* - zero or more whitespaces
(.*) - Group 2 (header2): any zero or more chars other than line break chars, as many as possible.

You may extract to a new dataframe using named capturing groups:

>>> new_df = df['header'].str.extract(r'^\s*(?P<header1>(?:[A-Z] \b(?!\.)(?:\s [A-Z] )*)?)\s*(?P<header2>.*)', expand=True)
>>> new_df
    header1      header2
0   THIS IS   an example
1      ALSO         this
2  ONE LAST             
3            J. one more

CodePudding user response：

You can also use 2 named capture groups, and join the columns.

^(?P<header1>[A-Z] (?:[^\S\n] [A-Z] )*)?(?:(?:^|[^\S\n] )(?P<header2>. ))?$

(or use [a-z].* instead of . if it must start with a lowercase char)

^ Start of string
(?P<header1>[A-Z] Capture group header1, match 1 chars A-Z
(?:[^\S\n] [A-Z] )*)? Optionally match spaces and 1 chars A-Z
(?: Non capture group
- (?:^|[^\S\n] ) Either assert the start of the string or match 1 spaces
- (?P<header2>. ) Named group header2 match 1 chars
)? Close group and make it optional
$ End of string

See a regex demo and a Python demo.

Example

import pandas as pd

strings = [
    "THIS IS an example",
    "ALSO this",
    "ONE LAST",
    "J. one more"
]
df1 = pd.DataFrame(strings, columns=["header"])
df1 = df1.join(
    df1['header'].str.extract(
        '^(?P<header1>[A-Z] (?:[^\S\n] [A-Z] )*)?(?:(?:^|[^\S\n] )(?P<header2>. ))?$',
        expand=True
    )
    .fillna('')
)

print(df1)

Output

               header   header1      header2
0  THIS IS an example   THIS IS   an example
1           ALSO this      ALSO         this
2            ONE LAST  ONE LAST             
3         J. one more            J. one more