I have a corpus of newspaper articles in a .txt
file, and I'm trying to split the sentences from it to a .csv
in order to annotate each sentence.
I was told to use NLTK
for this purpose, and I found the following code for sentence splitting:
import nltk
from nltk.tokenize import sent_tokenize
sent_tokenize("Here is my first sentence. And that's a second one.")
However, I'm wondering:
- How does one use a
.txt
file as an input for the tokenizer (so that I don't have to just copy and paste everything), and - How does one output a
.csv
file instead of just printing the sentences in my terminal.
CodePudding user response:
Reading a .txt
file & tokenizing its sentences
Assuming the .txt
file is located in the same folder as your Python script, you can read a .txt
file and tokenize the sentences using NLTK
as shown below:
from nltk import sent_tokenize
with open("myfile.txt") as file:
textFile = file.read()
tokenTextList = sent_tokenize(textFile)
print(tokenTextList)
# Output: ['Here is my first sentence.', "And that's a second one."]
Writing a list of sentence tokens to .csv
file
There are a number of options for writing a .csv
file. Pick whichever is more convenient (e.g. if you already have pandas
loaded, use the pandas
option).
To write a .csv
file using the pandas
module:
import pandas as pd
df = pd.DataFrame(tokenTextList)
df.to_csv("myCSVfile.csv", index=False, header=False)
To write a .csv
file using the numpy
module:
import numpy as np
np.savetxt("myCSVfile.csv", tokenTextList, delimiter=",", fmt="%s")
To write a .csv
file using the csv
module:
import csv
with open('myCSVfile.csv', 'w', newline='') as file:
write = csv.writer(file, lineterminator='\n')
# write.writerows([tokenTextList])
write.writerows([[token] for token in tokenTextList]) # For pandas style output