how to split a multiline text in dataframe column into multiple columns using start and end words as-CodePudding

df[1] output:

index id	Raw Text
ROW1	STARTWORD: MULTILINE TEXT TO COL1 ENDWORD(=STARTWORD2) MULTILINE TEXT TO COL2 ENDWORD2(=STARTWORD3) MULTILINE TEXT TO COL3 ENDWORD3

the start and end words are similar for all the rows but position is different.
the final result should look like this:

Final_df.head() output:

TextID	col1	col2	col3	col4
ROW1	extracted text1	extracted text2	extracted text3	extracted text4

CodePudding user response：

I guess there are many ways to do the splitting. One easy way to do this that keeps the number of split words variable would be:

Define the splitting of a text into substrings as a function to make it easier to add to the DataFrame:

import re

def split_by_words(text_to_split, split_words, end_word):

    # remove first start word and final end word from text as we dont need them to split
    # remove first split word from split_words list
    text_to_split = text_to_split[len(split_words.pop(0)):-len(end_word)]

    # move through split_words list from left to right, split by that word and append the left part to cols
    # remove the used split word from split_words list for each step
    cols = []
    while len(split_words) > 1:
        col, text_to_split = tuple(re.split(split_words.pop(0), text_to_split, maxsplit=1))
        cols.append(col)
    # when only one split word ( = two cols) are left: split remaining text into these two cols and append to cols
    second_last_col, last_col = tuple(re.split(split_words.pop(0), text_to_split, maxsplit=1))
    cols.append(second_last_col)
    cols.append(last_col)

    # return the split cols as tuple
    return tuple(cols)

Apply it to a new DataFrame to create three columns from the returned tuple for each row:

text = "STARTWORD1 multiline text to col1 STARTWORD2 multiline text to col2 STARTWORD3 multiline text to col3 ENDWORD3"
df = pandas.DataFrame([text], columns=["Raw Text"])
split_words = ["STARTWORD1", "STARTWORD2", "STARTWORD3"]
end_word = "ENDWORD3"
new_df = pandas.DataFrame()
new_df[['col1', 'col2', 'col3']] = df.apply(lambda row: pandas.Series(split_by_words(str(row["text"]), split_words, end_word)), axis=1)
print(new_df)
>     col1                      col2                      col3
  0   multiline text to col1    multiline text to col2    multiline text to col3

EDIT: made the number of split words dynamic

CodePudding user response：

Even though it's not entirely clear what you want to achieve, I think you're looking for extract.

Make some test data

import pandas as pd
import re # for the re.DOTALL flag

data = {"index_id": ["ROW1"],
        "raw_text": ["STARTWORD: MULTILINE TEXT TO COL1 ENDWORD MULTILINE\nTEXT TO COL2 ENDWORD2 MULTILINE TEXT TO COL3 ENDWORD3"]}

df = pd.DataFrame(data).set_index("index_id")

df looks like this:

                                                   raw_text
index_id                                                   
ROW1      STARTWORD: MULTILINE TEXT TO COL1 ENDWORD MULT...

Extract the columns

The following matches everything in between the split words as long as the order in the list matches the order of their occurrence in the raw string. (You need the re.DOTALL flag so the dot . matches newlines, too.)

split_words = ["STARTWORD:", "ENDWORD", "ENDWORD2", "ENDWORD3"]

new_df = df.raw_text.str.extract("(. )".join(split_words), flags=re.DOTALL)

Result

                                 0                          1                         2
index_id                                                                               
ROW1       MULTILINE TEXT TO COL1    MULTILINE\nTEXT TO COL2    MULTILINE TEXT TO COL3