I have a text file of about 7000 sentences. Every sentence is in a new line. The sample format of my text file's data is given below. I want to change the format and clean the data using python
.
(input.txt)
I\PP.sg.n.n am\VM.3.fut.sim.dcl.fin.n.n.n going\VER.0.gen.n.n to\NC.0.0.n.n school\JQ.n.n.crd .\PU
When\PPR.pl.1.0.n.n.n.n I\PP.0.y go\VM.0.0.0.0.nfn.n.n.n outside\NC.0.0.n.n ,\PU I\NST.0.n.n saw\NN.loc.n.n something\DAB.sg.y .\PU
I\PP.0.y eat\JQ.n.n.nnm rice\NC.0.loc.n.n .\PU
I want to change the format of the above data of the text file and want the below format in CSV.
(input.csv)
Sentences | Tags |
---|---|
I am going to school . | PP VM VER NC JQ PU |
When I go outside , I saw something . | PPR PP VM NC PU NST NN DAB PU |
I eat rice . | PP JQ NC PU |
I have tried some approaches but nothing is working properly to get my desired format. I am really confused. It would be a great help for me if any kind soul can help me. Thanks in advance for the help.
CodePudding user response:
Python Code:
txt = r"""
I\PP.sg.n.n am\VM.3.fut.sim.dcl.fin.n.n.n going\VER.0.gen.n.n to\NC.0.0.n.n school\JQ.n.n.crd .\PU
When\PPR.pl.1.0.n.n.n.n I\PP.0.y go\VM.0.0.0.0.nfn.n.n.n outside\NC.0.0.n.n ,\PU I\NST.0.n.n saw\NN.loc.n.n something\DAB.sg.y .\PU
I\PP.0.y eat\JQ.n.n.nnm rice\NC.0.loc.n.n .\PU
"""
for line in txt.strip().split('\n'):
words, tags = [], []
for wordtag in line.strip().split():
splits = wordtag.split('\\', 1)
words.append(splits[0])
tags.append(splits[1].split('.')[0])
print(f"\"{' '.join(words)}\",\"{' '.join(tags)}\"")
Output:
"I am going to school .","PP VM VER NC JQ PU"
"When I go outside , I saw something .","PPR PP VM NC PU NST NN DAB PU"
"I eat rice .","PP JQ NC PU"