How convert the specific conversational files into columns and save it in python-CodePudding

This is my text file. I want to convert it into columns such as speaker and comments and save it as csv. I have a huge list. So computing it will be helpful.

>bernardo11_5
Have you had quiet guard?

>francisco11_5
Not a mouse stirring.

>bernardo11_6
Well, good night.
If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.

>francisco11_6
I think I hear them.--Stand, ho! Who is there?

>horatio11_1
Friends to this ground.

>marcellus11_1
And liegemen to the Dane.

CodePudding user response：

one way is to parse and load

read the file

with open("test.txt") as fp:
    data = fp.readlines()

remove empty lines

data = [x for x in data if x != "\n"]

separate into speaker and comments

speaker = []
comments = []

speaker_text = ""
for value in data:
    
    if ">" in value:
        speaker_text = value
    else:
        speaker.append(speaker_text)
        comments.append(value)

convert to dataframe

df = pd.DataFrame({
    "speaker": speaker,
    "comments": comments
})

save as csv

df.to_csv("result.csv", index=False)

output

            speaker                                          comments
0   >bernardo11_5\n                       Have you had quiet guard?\n
1  >francisco11_5\n                           Not a mouse stirring.\n
2   >bernardo11_6\n                               Well, good night.\n
3   >bernardo11_6\n           If you do meet Horatio and Marcellus,\n
4   >bernardo11_6\n    The rivals of my watch, bid them make haste.\n
5  >francisco11_6\n  I think I hear them.--Stand, ho! Who is there?\n
6    >horatio11_1\n                         Friends to this ground.\n
7  >marcellus11_1\n                       And liegemen to the Dane.\n

CodePudding user response：

Something like this?

import re
from pathlib import Path

import pandas as pd

input = Path('input.txt').read_text()

speaker = re.findall(">(.*)", input)
comments = re.split(">.*", input)
comments = [c.strip() for c in comments if c.strip()]

df = pd.DataFrame({'speaker': speaker, 'comments': comments})

This will give you full comments including newline characters. For saving:
a) replace '\n' before calling to_csv()

df.comments = df.comments.str.replace('\n', '\\n')

b) save to a more suitable format, e.g., to_parquet()
c) split single comment into multiple rows

df.comments = df.comments.str.split('\n')
df.explode('comments')