I am working to clean up a messy .txt file (text ID and raw text) for NLP analyses.
Currently it looks like:
@@0001 words 83 words, 90, words, 8989! @@0002 words, 98 words; words. @@0003 words 30 words ....
I would like to get it into a clean .txt or .csv format with each text on its own line and the ID separated from the text by a delimiter.
ID | text
0001 | words 83 words, 90, words, 8989!
0002 | words, 98 words; words.
0003 | words 30 words ....
The following code creates a .txt file where each text is on its own line:
with open('/file_directory/file.txt', 'r') as file, open('/file_directory/file_cleaned.txt', 'w') as file2:
for line in file:
for word in line.split('@@'):
file2.write(word '\n')
e.g.,
0001 words 83 words, 90, words, 8989!
0002 words, 98 words; words.
0003 words 30 words ....
However, I can't figure out how to add the delimiter since I can't match for either a specific series of integers or an integer length (e.g., 4 digits). Currently, I am trying to first add the delimiters through a regular expression and then split lines, but I am running into regex and file writing issues.
import re
with open('/filedirectory/file.txt', 'r') as file, open('/filedirectory/file_cleaned.txt', 'w') as file2:
text = file1.readlines()
for line in text:
text.re.split('^@\d{4,7}')
for word in line.split('@@'):
file2.write(word '\n')
I get the error:
AttributeError: 'list' object has no attribute 're'
Any thoughts would be much appreciated. Thanks!
CodePudding user response:
Right, it goes without saying that a list
object has no attribute re
.
You can use
with open('/file_directory/file.txt', 'r') as file, open('/file_directory/file_cleaned.txt', 'w') as file2:
file2.write(re.sub(r'@@\d ', r'\n\g<0> | ', file.read()).lstrip())
The regex matches @@
and one or more digits, and replaces the matches with a line feed char, the whole match value, and a |
char enclosed with single spaces.
See the Python demo:
import re
s = "@@0001 words 83 words, 90, words, 8989! @@0002 words, 98 words; words. @@0003 words 30 words ...."
print( re.sub(r'(@@\d )', r'\n\1 | ', s).lstrip() )
Output:
@@0001 | words 83 words, 90, words, 8989!
@@0002 | words, 98 words; words.
@@0003 | words 30 words ....
CodePudding user response:
Regex and Strings are two different types. As a reference, here is a list of every method for the Python String Type and for the regex type objects
The reason you get an error is because you are trying to access a regex method from an object of type list*.
For your purposes, though:
- String can be split using regex syntax,
- Alternatively, you can use the
re
module to split them.
However, what you are trying to do is combine both.
You can either do:
splitlines = line.split('^@\d{4,7}')
Or, you can use regex:
import re
splitlines = re.compile('^@\d{4,7}').split(line)