Python regex.split text export each splitting as .txt with each of word splitting as filename to specified path folder
Hello community! I learn Python and I try to make different actions with a text :
- Split text with NLTK regex.split
- Regex.split without empty results as '' and solo '-' except 'word-hyphen'
- Export each split as .txt with each split words as filename to a specified folder --> Regex.split without empty results as '' and solo '-' except 'word-hyphen' to not create empty files
Step 1 is done :
# coding: utf-8
import nltk
s = "This sentence is in first place. This second sentence isn't in first place."
import regex
regex.split("[\s\.\,]", s)
['This', 'sentence', 'is', 'in', 'first', 'place', '', 'This', 'second', 'sentence', "isn't", 'in', 'first', 'place', '']
Step 2 and 3 is what I try to do :
2. Do not count empty results as '' and solo '-' except 'word-hyphen'
What is done for step 2 :
# coding: utf-8
import nltk
s = "This sentence is in first place and contain a word-hyphen — Hello I am the second sentence and I'm in second place."
import regex
regex.split("[\s\.;!?…»,«\,]", s)
['This', 'sentence', 'is', 'in', 'first', 'place', 'and', 'contain', 'a', 'word-hyphen', '-', 'Hello', 'I', 'am', 'the', 'second', 'sentence', 'and', "I'm", 'in', 'second', 'place', '']
3. Export each split as .txt with each split words as filename to a specified folder
Someone knows how we can make something like that ?
Thanks for your help
CodePudding user response:
You're not using nltk's regular expression engine. Maybe you want RegexpTokenizer
?
Because you're not using variables and have this "automatic printing", I'm guessing you're using command-line or IDLE. You'll have to use variables for the step 3, and at some point you'll have to use .py
files too. Let's begin now; and if I'm wrong, sorry.
As you're asked to not have empty results in step 2, it indicates that you have a problem in step 1. Let's try with RegexpTokenizer
then:
from nltk.tokenize import RegexpTokenizer
s = "This sentence is in first place. This second sentence isn't in first place."
tokenizer = RegexpTokenizer("[\s\.\,]", gaps=True)
split=tokenizer.tokenize(s)
print(split)
Output:
['This', 'sentence', 'is', 'in', 'first', 'place', 'This', 'second', 'sentence', "isn't", 'in', 'first', 'place']
No empty results here, we're good.
For step2, I don't understand your regex: just take the one from step1 "[\s\.\,]"
and add the dash "[\s\.\,—]"
:
from nltk.tokenize import RegexpTokenizer
s = "This sentence is in first place and contain a word-hyphen — Hello I am the second sentence and I'm in second place."
tokenizer = RegexpTokenizer("[\s\.\,—]", gaps=True)
split=tokenizer.tokenize(s)
print(split)
Output:
['This', 'sentence', 'is', 'in', 'first', 'place', 'and', 'contain', 'a', 'word-hyphen', 'Hello', 'I', 'am', 'the', 'second', 'sentence', 'and', "I'm", 'in', 'second', 'place']
For the step 3, the simplest way should be this:
import os.path
path_to_files = 'C:\\Users\\username\\Desktop\\Split txt export'
for word in split:
filename=word '.txt'
fullpath=os.path.join(path_to_files, filename)
with open(fullpath, 'w') as f:
f.write(word)