I have a pd dataframe with rows containing document, their respective ids. I would like to split each document into paragraphs while keeping their respective ids:
ORIGINAL DATAFRAME
data={'id':['1','2','3'], 'doc':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.']}
df=pd.DataFrame(data)
DESIRED OUTPUT (one paragraph per row with respective id)
id doc
1 paragraph 1.
1 paragraph 2.
1 paragraph 3.
2 paragraph 1.
2 paragraph 2.
3 paragraph 1.
3 paragraph 2.
3 paragraph 3.
3 paragraph 4.
I have the following function:
def split_into_paragraphs(dataframe):
docs = dataframe.to_json(orient="records")
paragraphs: List[str] = []
doc_indices: List[str] = []
for doc in docs:
for para in range(len(doc)):
paragraphs.append(str(doc[para].split("\n")))
doc_indices = doc_indices [doc["id"]]
return (doc_indices, paragraphs)
I am getting this error:
TypeError Traceback (most recent call last)
c:path.ipynb Cellule 32 in <cell line: 1>()
----> 1 df2 = split_into_paragraphs(df)
c:path.ipynb in split_into_paragraphs(dataframe)
8 for para in range(len(doc)):
9 paragraphs.append(str(doc[para].split("\n")))
---> 10 doc_indices = doc_indices [doc["id"]]
12 return (doc_indices, paragraphs)
TypeError: string indices must be integers
Since I am new to python, I am having problems figuring out what to do. I am not sure if it is a problem with the function in itself or with how I am calling it:
df2 = df.apply(split_into_paragraphs)
A final note that might be important: in my real dataframe, ids can also be a combination of numbers and words (instead of numbers only).
I would appreciate your help figuring out this problem!
CodePudding user response:
Thank you for providing a code to reproduce your df
. You could use the apply
function to call a lambda
function that does what you need, for instance:
df = df.apply(lambda x: x.str.split('\n').explode())
This lambda
function will split the strings wherever it finds \n
and it will explode
it to new rows.