df[1]
output:
index id | Raw Text |
---|---|
ROW1 | STARTWORD: MULTILINE TEXT TO COL1 ENDWORD(=STARTWORD2) MULTILINE TEXT TO COL2 ENDWORD2(=STARTWORD3) MULTILINE TEXT TO COL3 ENDWORD3 |
- the start and end words are similar for all the rows but position is different.
- the final result should look like this:
Final_df.head()
output:
TextID | col1 | col2 | col3 | col4 |
---|---|---|---|---|
ROW1 | extracted text1 | extracted text2 | extracted text3 | extracted text4 |
CodePudding user response:
I guess there are many ways to do the splitting. One easy way to do this that keeps the number of split words variable would be:
Define the splitting of a text into substrings as a function to make it easier to add to the DataFrame:
import re
def split_by_words(text_to_split, split_words, end_word):
# remove first start word and final end word from text as we dont need them to split
# remove first split word from split_words list
text_to_split = text_to_split[len(split_words.pop(0)):-len(end_word)]
# move through split_words list from left to right, split by that word and append the left part to cols
# remove the used split word from split_words list for each step
cols = []
while len(split_words) > 1:
col, text_to_split = tuple(re.split(split_words.pop(0), text_to_split, maxsplit=1))
cols.append(col)
# when only one split word ( = two cols) are left: split remaining text into these two cols and append to cols
second_last_col, last_col = tuple(re.split(split_words.pop(0), text_to_split, maxsplit=1))
cols.append(second_last_col)
cols.append(last_col)
# return the split cols as tuple
return tuple(cols)
Apply it to a new DataFrame to create three columns from the returned tuple for each row:
text = "STARTWORD1 multiline text to col1 STARTWORD2 multiline text to col2 STARTWORD3 multiline text to col3 ENDWORD3"
df = pandas.DataFrame([text], columns=["Raw Text"])
split_words = ["STARTWORD1", "STARTWORD2", "STARTWORD3"]
end_word = "ENDWORD3"
new_df = pandas.DataFrame()
new_df[['col1', 'col2', 'col3']] = df.apply(lambda row: pandas.Series(split_by_words(str(row["text"]), split_words, end_word)), axis=1)
print(new_df)
> col1 col2 col3
0 multiline text to col1 multiline text to col2 multiline text to col3
EDIT: made the number of split words dynamic
CodePudding user response:
Even though it's not entirely clear what you want to achieve, I think you're looking for extract
.
Make some test data
import pandas as pd
import re # for the re.DOTALL flag
data = {"index_id": ["ROW1"],
"raw_text": ["STARTWORD: MULTILINE TEXT TO COL1 ENDWORD MULTILINE\nTEXT TO COL2 ENDWORD2 MULTILINE TEXT TO COL3 ENDWORD3"]}
df = pd.DataFrame(data).set_index("index_id")
df
looks like this:
raw_text
index_id
ROW1 STARTWORD: MULTILINE TEXT TO COL1 ENDWORD MULT...
Extract the columns
The following matches everything in between the split words as long as the order in the list matches the order of their occurrence in the raw string.
(You need the re.DOTALL
flag so the dot .
matches newlines, too.)
split_words = ["STARTWORD:", "ENDWORD", "ENDWORD2", "ENDWORD3"]
new_df = df.raw_text.str.extract("(. )".join(split_words), flags=re.DOTALL)
Result
0 1 2
index_id
ROW1 MULTILINE TEXT TO COL1 MULTILINE\nTEXT TO COL2 MULTILINE TEXT TO COL3