Home > Software design >  Splitting a pandas Data frame column depending if word is capital or not with Greek support
Splitting a pandas Data frame column depending if word is capital or not with Greek support

Time:12-16

I have a data frame column in the below format:

header
THIS IS an example
ΕΛΕΥΘΕΡΙΟ Άγιος Γεώργιος
ALSO this
ONE LAST
J. one more

I want to split it into two columns:

header1 header2
THIS IS an example
ALSO this
ONE LAST null
null J. one more

I have tried extracting the information like this:

df1['header'].str.split('[A-Z]', expand=True)

but my regular expressions are not up to par. Any help is much appreciated!

CodePudding user response:

Using str.extract we can try:

df["header1"] = df["header"].str.extract(r'^([A-Z] (?: [A-Z] )?)')
df["header2"] = df["header"].str.extract(r'\b([a-z] (?: [a-z] )?)')

CodePudding user response:

Greek Letter Notice

To only match Greek uppercase letters, replace [A-Z] in below patterns with [\u0391-\u03A1\u03A3-\u03A9]. To match both ASCII and Greek uppercase letters use [a-zA-Z\u0391-\u03A1\u03A3-\u03A9].

I.e.

rx = r'^\s*(?P<header1>(?:[\u0391-\u03A1\u03A3-\u03A9] \b(?!\.)(?:\s [\u0391-\u03A1\u03A3-\u03A9] )*\b)?)(?:\s (?P<header2>.*))?'
new_df = df['header'].str.extract(rx, expand=True)

See the regex demo.

You can use

df[['header1', 'header2']] = df['header'].str.extract(r'^\s*((?:[A-Z] \b(?!\.)(?:\s [A-Z] )*)?)\s*(.*)', expand=True)

Output:

>>> df
               header   header1      header2
0  THIS IS an example   THIS IS   an example
1           ALSO this      ALSO         this
2            ONE LAST  ONE LAST             
3         J. one more            J. one more           

See the regex demo.

Details:

  • ^ - start of string
  • \s* - zero or more whitespaces
  • ((?:[A-Z] \b(?!\.)(?:\s [A-Z] )*)?) - Group 1 (header1): an optional sequence of one or more uppercase ASCII letters (not followed with a . char) and then zero or more sequences of one or more whitespaces and one or more uppercase ASCII letters
  • \s* - zero or more whitespaces
  • (.*) - Group 2 (header2): any zero or more chars other than line break chars, as many as possible.

You may extract to a new dataframe using named capturing groups:

>>> new_df = df['header'].str.extract(r'^\s*(?P<header1>(?:[A-Z] \b(?!\.)(?:\s [A-Z] )*)?)\s*(?P<header2>.*)', expand=True)
>>> new_df
    header1      header2
0   THIS IS   an example
1      ALSO         this
2  ONE LAST             
3            J. one more

CodePudding user response:

You can also use 2 named capture groups, and join the columns.

^(?P<header1>[A-Z] (?:[^\S\n] [A-Z] )*)?(?:(?:^|[^\S\n] )(?P<header2>. ))?$

(or use [a-z].* instead of . if it must start with a lowercase char)

  • ^ Start of string
  • (?P<header1>[A-Z] Capture group header1, match 1 chars A-Z
  • (?:[^\S\n] [A-Z] )*)? Optionally match spaces and 1 chars A-Z
  • (?: Non capture group
    • (?:^|[^\S\n] ) Either assert the start of the string or match 1 spaces
    • (?P<header2>. ) Named group header2 match 1 chars
  • )? Close group and make it optional
  • $ End of string

See a regex demo and a Python demo.

Example

import pandas as pd

strings = [
    "THIS IS an example",
    "ALSO this",
    "ONE LAST",
    "J. one more"
]
df1 = pd.DataFrame(strings, columns=["header"])
df1 = df1.join(
    df1['header'].str.extract(
        '^(?P<header1>[A-Z] (?:[^\S\n] [A-Z] )*)?(?:(?:^|[^\S\n] )(?P<header2>. ))?$',
        expand=True
    )
    .fillna('')
)

print(df1)

Output

               header   header1      header2
0  THIS IS an example   THIS IS   an example
1           ALSO this      ALSO         this
2            ONE LAST  ONE LAST             
3         J. one more            J. one more

       
  • Related