Split Rows and add Delimiter to text file using python-CodePudding

I am working to clean up a messy .txt file (text ID and raw text) for NLP analyses.

Currently it looks like:

@@0001 words 83 words, 90, words, 8989! @@0002 words, 98 words; words. @@0003 words 30 words ....

I would like to get it into a clean .txt or .csv format with each text on its own line and the ID separated from the text by a delimiter.

ID   | text 
0001 | words 83 words, 90, words, 8989!
0002 | words, 98 words; words. 
0003 | words 30 words ....

The following code creates a .txt file where each text is on its own line:

with open('/file_directory/file.txt', 'r') as file, open('/file_directory/file_cleaned.txt', 'w') as file2:
    for line in file:
        for word in line.split('@@'):
           file2.write(word   '\n')

e.g.,

0001 words 83 words, 90, words, 8989!
0002 words, 98 words; words. 
0003 words 30 words ....

However, I can't figure out how to add the delimiter since I can't match for either a specific series of integers or an integer length (e.g., 4 digits). Currently, I am trying to first add the delimiters through a regular expression and then split lines, but I am running into regex and file writing issues.

import re
with open('/filedirectory/file.txt', 'r') as file, open('/filedirectory/file_cleaned.txt', 'w') as file2:
    text = file1.readlines()
    for line in text:
        text.re.split('^@\d{4,7}')
        for word in line.split('@@'):
           file2.write(word   '\n')

I get the error:

AttributeError: 'list' object has no attribute 're'

Any thoughts would be much appreciated. Thanks!

CodePudding user response：

Right, it goes without saying that a list object has no attribute re.

You can use

with open('/file_directory/file.txt', 'r') as file, open('/file_directory/file_cleaned.txt', 'w') as file2:
    file2.write(re.sub(r'@@\d ', r'\n\g<0> | ', file.read()).lstrip())

The regex matches @@ and one or more digits, and replaces the matches with a line feed char, the whole match value, and a | char enclosed with single spaces.

See the Python demo:

import re
s = "@@0001 words 83 words, 90, words, 8989! @@0002 words, 98 words; words. @@0003 words 30 words ...."
print( re.sub(r'(@@\d )', r'\n\1 | ', s).lstrip() )

Output:

@@0001 |  words 83 words, 90, words, 8989! 
@@0002 |  words, 98 words; words. 
@@0003 |  words 30 words ....

CodePudding user response：

Regex and Strings are two different types. As a reference, here is a list of every method for the Python String Type and for the regex type objects

The reason you get an error is because you are trying to access a regex method from an object of type list*.

For your purposes, though:

String can be split using regex syntax,
Alternatively, you can use the re module to split them.

However, what you are trying to do is combine both.

You can either do:

splitlines =  line.split('^@\d{4,7}')

Or, you can use regex:

import re
splitlines = re.compile('^@\d{4,7}').split(line)