I have a data frame like this below:
data = {'c1':['Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n',
'Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
'c2':["one", "two", "three", "four"]}
I would like to create:
a regex that extract anything after
Thrown: lib:
until the first\n
. I will call this "group 01". So I will have this below:data = {'c3':['this is problem type 01', 'this is problem type 01', 'this is problem type 02', 'this is problem type 04']}
and then I want to create a regex that extract everything after "group 01"(the previous regex), ignoring the
\t
and\n
that is between the sentences and going until the next\n
. So I will have this below:data = {'c4':['Error executing the statement: error statement 1', 'Error executing the statement: error statement 3', 'Error executing the statement: error statement2', 'Error executing the statement: error statement1']}
In the end I want my dataframe to be like this:
data = {'c1':['Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3',
'Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1'],
'c3':['this is problem type 01',
'this is problem type 01',
'this is problem type 02',
'this is problem type 04'],
'c4':['Error executing the statement: error statement 1',
'Error executing the statement: error statement 3',
'Error executing the statement: error statement2',
'Error executing the statement: error statement1'],
'c2':["one", "two", "three", "four"]}
This is what I have so far, I was trying to extract from "Thrown: lib:
" until the first \n
, but it do not works.
df = pd.DataFrame(data)
df['exception'] = df['c1'].str.extract(r'Thrown: lib: (.*(?:\r?\n.*)*)', expand=False)
Any help is appreciated, thank you :)
CodePudding user response:
Could probably do it as one-liner, but something like this:
import re
import pandas as pd
data = {'c1':['Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n',
'Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n',
'Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n'],
'c2':["one", "two", "three", "four"]}
df = pd.DataFrame(data)
pattern1 = 'Thrown: lib: ([a-zA-Z\d\s]*)\\n'
df['c3'] = df['c1'].str.extract(pattern1, expand=False).str.strip()
pattern2 = '(Error.*)\\n'
df['c4'] = df['c1'].str.extract(pattern2, expand=False).str.strip()
Output:
print(df.to_string())
c1 c2 c3 c4
0 Level: LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 1\n one this is problem type 01 Error executing the statement: error statement 1
1 Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 01\n \t\n \tError executing the statement: error statement 3\n two this is problem type 01 Error executing the statement: error statement 3
2 Level: LOGGING_ONLY\n Thrown: lib: this is problem type 02\n \t\n \tError executing the statement: error statement2\n three this is problem type 02 Error executing the statement: error statement2
3 Level: NOT_LOGGING_ONLY\n Thrown: lib: this is problem type 04\n \t\n \tError executing the statement: error statement1\n four this is problem type 04 Error executing the statement: error statement1
CodePudding user response:
I would use the re
package:
data['c3'] = [re.findall("Thrown: lib: ([^\n] )", x) for x in data['c1']]
data['c4'] = [re.split("\n", x)[3].strip() for x in data['c1']]
- The first pattern extract everything between
Thrown: lib:
and the first newline - the second pattern assume that the relevant message is always the 4th token, when split by
\n
, which seems to be the case