Split text rows in dataframe into paragraphs and keep document id

I have a pd dataframe with rows containing document, their respective ids. I would like to split each document into paragraphs while keeping their respective ids:

ORIGINAL DATAFRAME

data={'id':['1','2','3'], 'doc':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.']}

df=pd.DataFrame(data)

DESIRED OUTPUT (one paragraph per row with respective id)


id    doc                                             
           
1     paragraph 1.                             
1     paragraph 2.                              
1     paragraph 3.                              
2     paragraph 1.                              
2     paragraph 2.                        
3     paragraph 1.                              
3     paragraph 2.                              
3     paragraph 3.                              
3     paragraph 4.

I have the following function:

def split_into_paragraphs(dataframe): 
    
    docs = dataframe.to_json(orient="records")
    paragraphs: List[str] = []
    doc_indices: List[str] = []

    for doc in docs:         
        for para in range(len(doc)):
            paragraphs.append(str(doc[para].split("\n")))
            doc_indices = doc_indices   [doc["id"]]

    return (doc_indices, paragraphs)

I am getting this error:

TypeError                                 Traceback (most recent call last)
c:path.ipynb Cellule 32 in <cell line: 1>()
----> 1 df2 = split_into_paragraphs(df)

c:path.ipynb in split_into_paragraphs(dataframe)
      8     for para in range(len(doc)):
      9         paragraphs.append(str(doc[para].split("\n")))
---> 10         doc_indices = doc_indices   [doc["id"]]
     12 return (doc_indices, paragraphs)

TypeError: string indices must be integers

Since I am new to python, I am having problems figuring out what to do. I am not sure if it is a problem with the function in itself or with how I am calling it:

df2 = df.apply(split_into_paragraphs)

A final note that might be important: in my real dataframe, ids can also be a combination of numbers and words (instead of numbers only).

I would appreciate your help figuring out this problem!

CodePudding user response：

Thank you for providing a code to reproduce your df. You could use the apply function to call a lambda function that does what you need, for instance:

df = df.apply(lambda x: x.str.split('\n').explode())

This lambda function will split the strings wherever it finds \n and it will explode it to new rows.