This is my text file. I want to convert it into columns such as speaker and comments and save it as csv. I have a huge list. So computing it will be helpful.
>bernardo11_5
Have you had quiet guard?
>francisco11_5
Not a mouse stirring.
>bernardo11_6
Well, good night.
If you do meet Horatio and Marcellus,
The rivals of my watch, bid them make haste.
>francisco11_6
I think I hear them.--Stand, ho! Who is there?
>horatio11_1
Friends to this ground.
>marcellus11_1
And liegemen to the Dane.
CodePudding user response:
one way is to parse and load
read the file
with open("test.txt") as fp:
data = fp.readlines()
remove empty lines
data = [x for x in data if x != "\n"]
separate into speaker and comments
speaker = []
comments = []
speaker_text = ""
for value in data:
if ">" in value:
speaker_text = value
else:
speaker.append(speaker_text)
comments.append(value)
convert to dataframe
df = pd.DataFrame({
"speaker": speaker,
"comments": comments
})
save as csv
df.to_csv("result.csv", index=False)
output
speaker comments
0 >bernardo11_5\n Have you had quiet guard?\n
1 >francisco11_5\n Not a mouse stirring.\n
2 >bernardo11_6\n Well, good night.\n
3 >bernardo11_6\n If you do meet Horatio and Marcellus,\n
4 >bernardo11_6\n The rivals of my watch, bid them make haste.\n
5 >francisco11_6\n I think I hear them.--Stand, ho! Who is there?\n
6 >horatio11_1\n Friends to this ground.\n
7 >marcellus11_1\n And liegemen to the Dane.\n
CodePudding user response:
Something like this?
import re
from pathlib import Path
import pandas as pd
input = Path('input.txt').read_text()
speaker = re.findall(">(.*)", input)
comments = re.split(">.*", input)
comments = [c.strip() for c in comments if c.strip()]
df = pd.DataFrame({'speaker': speaker, 'comments': comments})
This will give you full comments including newline characters.
For saving:
a) replace '\n' before calling to_csv()
df.comments = df.comments.str.replace('\n', '\\n')
b) save to a more suitable format, e.g., to_parquet()
c) split single comment into multiple rows
df.comments = df.comments.str.split('\n')
df.explode('comments')