How to select only numbers/digits from a given string and skip text using python regex?-CodePudding

Given Strings:

57 years, 67 daysApr 30, 1789

61 years, 125 daysMar 4, 1797

57 years, 325 daysMar 4, 1801

57 years, 353 daysMar 4, 1809

58 years, 310 daysMar 4, 1817

In regex101:

Pattern = (?P<Years>[\d]{1,2}) years, (?P<Days>[\d]{1,3}) days(?P<Month>[\w]{3} [\d]{1,2}), (?P<Year>[\d]{4})

Output: Output of Regex Pattern

In Python(IDE : Jupyter Notebook) : Python Output Here it is showing only nan values in dataframe, how to solve this ?

CodePudding user response：

Use:

#Preparing data
string = """57 years, 67 daysApr 30, 1789
61 years, 125 daysMar 4, 1797
57 years, 325 daysMar 4, 1801
57 years, 353 daysMar 4, 1809
58 years, 310 daysMar 4, 1817"""
df = pd.DataFrame(string.split('\n'))

#Solution
temp = df[0].str.extractall('(?P<Years>[\d]{1,2}) years, (?P<Days>[\d]{1,3}) days(?P<Month>[\w]{3} [\d]{1,2}), (?P<Year>[\d]{4})')

Output:

        Years   Days    Month   Year
match               
0   0   57  67  Apr 30  1789
1   0   61  125 Mar 4   1797
2   0   57  325 Mar 4   1801
3   0   57  353 Mar 4   1809
4   0   58  310 Mar 4   1817

CodePudding user response：

FYI, you code ran perfectly for me, maybe you have some whitespace issues in your dataframe:

import pandas as pd
import numpy as np

from io import StringIO

st = StringIO("""57 years, 67 daysApr 30, 1789

61 years, 125 daysMar 4, 1797

57 years, 325 daysMar 4, 1801

57 years, 353 daysMar 4, 1809

58 years, 310 daysMar 4, 1817""")

df = pd.read_csv(st, sep='\s\s\s ', header=None, engine='python')

Pattern = '(?P<Years>[\d]{1,2}) years, (?P<Days>[\d]{1,3}) days(?P<Month>[\w]{3} [\d]{1,2}), (?P<Year>[\d]{4})'

df[0].str.extract(Pattern)

Output:

  Years Days   Month  Year
0    57   67  Apr 30  1789
1    61  125   Mar 4  1797
2    57  325   Mar 4  1801
3    57  353   Mar 4  1809
4    58  310   Mar 4  1817