Split Text Column Rows if length excess certain number-CodePudding

I have a dataframe that looks like below. I want to split the "text" column rows if the length is more than 5000 (Has more than 5000 characters).


         doc    name         text           len
    0   doc_1   Texas   I have a dream....  6221
    1   doc_2   Georgia I love eating....   13300
    2   doc_2   Idaho   Yesterday was....   5000
    :     :       :          :               :
    :     :       :          :               :
    n   doc_n    NY     Big Apple is....    2407

I want to create a function to check/loop through the whole dataframe to see if the length of the "text" column (number of characters) is more than 5000, then it will split the rows till all the rows are less than 5000.

The output that I want will look something like this below


         doc    name         text           len
    0   doc_1   Texas   I have a dream....  3110
    1   doc_1   Texas   This dream is a.... 3111
    2   doc_2   Georgia I love eating....   3325
    3   doc_2   Georgia Chick fil A is....  3325
    4   doc_2   Georgia Worl of coke....    3325
    5   doc_2   Georgia Atlanta Falcons.... 3325
    6   doc_2   Idaho   Yesterday was....   2500
    7   doc_2   Idaho   The potatoes....    2500
    :     :       :          :               :
    :     :       :          :               :
    n   doc_n    NY     Big Apple is....    2407

I have been trying to figure out how to do that but still cannot find a good way to solve it. It will be great if any of you can give me some advice or suggestions on it. Thanks!

CodePudding user response：

You can do this with some list slicing magic. It looks like you want to split the text at the end of a sentence, so the logic below takes that into account.

You'll need to implement some logic to insert the resulting list of strings into your dataframe.
I also ignored the even size distribution you seem to have in your resulting dataframe.

for input, I generated a simple lorem ipsum text from one of the many online generators.

def split_text_at_end_of_sentence(text: str, length=5000, trim=True) -> list:
    result = []
    while len(text) > length:
        # if needed, trim leading space (which you will have when splitting text on EOL characters)
        if trim:
            text = text.lstrip()
        # find end of line character
        end_of_line = text[:length].rfind('.')
        # if not found, set eol to current length
        if end_of_line == -1:
            end_of_line = length
        # append current piece of text to result
        result.append(text[:end_of_line 1])
        # remove appended piece of text from working text
        text = text[end_of_line 1:]
    # at end of loop, add remaining text to result
    if len(text) > 0:
        if trim:
            text = text.lstrip()
        result.append(text)
    return result



with open('lorem_ipsum.txt') as infile:
    lorem_ipsum = infile.read()


result = split_text_at_end_of_sentence(lorem_ipsum)


for line in result:
    print(line)
    print('-------------------------------------------------------')


resulting_size = 0
for line in result:
    resulting_size  = len(line)
    print(len(line))


print(f"Original size:\n  {len(lorem_ipsum)}\nResulting size:\n  {resulting_size}\nSize difference:  {len(lorem_ipsum)-resulting_size}")

I've added some print statements so you can see what is actually being done. Note the difference in the resulting text size if you allow it to trim the leading spaces. You may need to set trim to False depending on your requirements...

EDIT AFTER COMMENTS

The input text is horrible :-/
There's no line with length > 5000...

That being said, the logic needed some updating to deal with it. The code below loads the input from sample.csv and splits it on several possible characters, recounts the length of the text column, and appends it to the outputdf. I've changed the split to 1000 characters, as there was no row where the text column had more than 5000 characters.

NOTE: You may want to update the splitting logic to just split on the last space character in a text field (eg normal wordwrapping) as the text seems to use quotes quite a bit, and splitting on punctuation might look odd when it happens on a quote....

import pandas as pd


def split_text_at_end_of_sentence(text: str, length=5000, trim=True) -> list:
    result = []
    # characters on which to split
    splitchars = ".,?!"
    while len(text) > length:
        # if needed, trim leading space (which you will have when splitting text on EOL characters)
        if trim:
            text = text.lstrip()
        # find end of line character
        end_of_line = -1
        for c in splitchars:
            end_of_line = max(end_of_line, text[:length].rfind(c))
        # if not found, try finding last space character
        if end_of_line == -1:
            end_of_line = text[:length].rfind(" ")
        # if still not found, set eol to current length
        if end_of_line == -1:
            end_of_line = length
        # append current piece of text to result
        result.append(text[: end_of_line   1])
        # remove appended piece of text from working text
        text = text[end_of_line   1 :]
    # at end of loop, add remaining text to result
    if len(text) > 0:
        if trim:
            text = text.lstrip()
        result.append(text)
    return result


inputdf = pd.read_csv("sample.csv")
outputdf = pd.DataFrame(columns=["doc_name", "mention", "text", "length"])
for index, row in inputdf.iterrows():
    for split_text in split_text_at_end_of_sentence(row["text"], length=1000):
        outputdf = outputdf.append(
            [
                {
                    "doc_name": row["doc_name"],
                    "mention": row["mention"],
                    "text": split_text,
                    "length": len(split_text),
                }
            ]
        )
outputdf.to_csv("output.csv", index=False)

CodePudding user response：

Here you have a possible solution using slicing (meaning, each text longer than 5000 characters will be cut at said length a the remaining text will continue in the next row):

df = pd.DataFrame({
    "doc": ["doc_1", "doc_2"],
    "name": ["Texas", "Georgia"],
    "text": ["This is a random text", "This is another text that I want to split in multiple rows"],
})
# Use a shorter size for simplicity
MAX_TEXT_LENGTH = 25

We'll split the text in heads and tails. Head contains the first MAX_TEXT_LENGTH characters and tails the remainder of a slicing operation. Then, use a loop to save the heads and keep spliting the tails, and stop when there are no elements in the tail left (when there are no pieces of text longer than MAX_TEXT_LENGTH):

# Save head parts: string[:25]
head_part = df["text"].str.slice(0, MAX_TEXT_LENGTH)
# Save tail parts: string[25:]
tail_part = df[df["text"].str.len() > MAX_TEXT_LENGTH]["text"].str.slice(MAX_TEXT_LENGTH)

while len(tail_part):
    head_part = pd.concat([head_part, tail_part.str.slice(0, MAX_TEXT_LENGTH)])
    tail_part = tail_part[tail_part.str.len() > MAX_TEXT_LENGTH].str.slice(MAX_TEXT_LENGTH)

# Connect the text pieces with your original data 
df = df[["doc", "name"]].merge(pd.concat([head_part, tail_part]), left_index=True, right_index=True)
# Add the len column if you want
df["len"] = df["text"].str.len()

This should result in something like this:

     doc     name                       text  len
0  doc_1    Texas      This is a random text   21
1  doc_2  Georgia  This is another text that   25
1  doc_2  Georgia   I want to split in multi   25
1  doc_2  Georgia                   ple rows    8