Python create a new column with extracted regex until \n from a dataframe-CodePudding

I have a data frame like this below:

data = {'c1':['Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n', 
              'Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
        'c2':["one", "two", "three", "four"]}

I would like to create:

a regex that extract anything after Thrown: lib: until the first \n. I will call this "group 01". So I will have this below:

data = {'c3':['this is problem type 01', 
               'this is problem type 01', 
               'this is problem type 02', 
               'this is problem type 04']}

and then I want to create a regex that extract everything after "group 01"(the previous regex), ignoring the \t and \n that is between the sentences and going until the next \n. So I will have this below:

data = {'c4':['Error executing the statement: error statement 1', 
            'Error executing the statement: error statement 3', 
            'Error executing the statement: error statement2', 
            'Error executing the statement: error statement1']}

In the end I want my dataframe to be like this:

data = {'c1':['Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3', 
              'Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1'],
        'c3':['this is problem type 01', 
              'this is problem type 01', 
              'this is problem type 02', 
              'this is problem type 04'],
        'c4':['Error executing the statement: error statement 1', 
              'Error executing the statement: error statement 3', 
              'Error executing the statement: error statement2', 
              'Error executing the statement: error statement1'],
        'c2':["one", "two", "three", "four"]}

This is what I have so far, I was trying to extract from "Thrown: lib:" until the first \n, but it do not works.

df = pd.DataFrame(data)
df['exception'] = df['c1'].str.extract(r'Thrown: lib: (.*(?:\r?\n.*)*)', expand=False)

Any help is appreciated, thank you :)

CodePudding user response：

Could probably do it as one-liner, but something like this:

import re
import pandas as pd


data = {'c1':['Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n', 
              'Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n', 
              'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
        'c2':["one", "two", "three", "four"]}



df = pd.DataFrame(data)

pattern1 = 'Thrown: lib: ([a-zA-Z\d\s]*)\\n'
df['c3'] = df['c1'].str.extract(pattern1, expand=False).str.strip()

pattern2 = '(Error.*)\\n'
df['c4'] = df['c1'].str.extract(pattern2, expand=False).str.strip()

Output:

print(df.to_string())
                                                                                                                           c1     c2                       c3                                                c4
0  Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n    one  this is problem type 01  Error executing the statement: error statement 1
1  Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n    two  this is problem type 01  Error executing the statement: error statement 3
2   Level:     LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n  three  this is problem type 02   Error executing the statement: error statement2
3   Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n   four  this is problem type 04   Error executing the statement: error statement1

CodePudding user response：

I would use the re package:

data['c3'] = [re.findall("Thrown: lib: ([^\n] )", x) for x in data['c1']]
data['c4'] = [re.split("\n", x)[3].strip() for x in data['c1']]

The first pattern extract everything between Thrown: lib: and the first newline
the second pattern assume that the relevant message is always the 4th token, when split by \n, which seems to be the case