I am writing code where a user must define the separator used to read in columns from a large text file (too large to hold in memory). That separator is read in via a text file imported via the Pandas module.
Consider the following test code:
import pandas
table_file = 'test.txt'
input_file = 'test_input.txt'
inputs = pandas.read_csv('test_input.txt', sep = '\n', dtype = str)
inputs = inputs['Inputs']
sepa = inputs[0]
print(sepa)
with open(table_file) as f:
next(f)
for line in f:
line_stripped = line.rstrip('\n')
line_list_1 = line_stripped.split(sepa)
line_list_2 = line_stripped.split('\t')
print(line_list_1)
print(line_list_2)
The file "test.txt" contains:
A B C D E F G H I J
1 2 3 4 7 8 9 10
The file "test_input.txt" contains:
Inputs
\t
The following lines are printed:
\t
['1\t2\t3\t4\t\t\t7\t8\t9\t10']
['1', '2', '3', '4', '', '', '7', '8', '9', '10']
Why is "line_list_2" working correctly while "line_list_1" is not? How can I fix this? Note that the separator must be read in via an input file; I cannot simply define it in the code or read it in via console input. The separator might also be, for example, a comma or a space.
CodePudding user response:
Because one is being considered as an escape character and another is a literal string. A variable that is defined from reading a text that contains "\t" will not be considering the escape character whereas you creating "\t" will and therefore assign it an alternative behavior (which in this case would be considered as a valid separator for your text). Please have a look at this post.
CodePudding user response:
As the answer above states, the text file is passing the literal "\t" and not the escape character. I tried to reproduce your error on Colab here, and found that, at least on this system, the code was run correctly.
with open('test.txt', 'w') as f:
f.write('A B C D E F G H I J \n\
1 2 3 4 7 8 9 10 ')
f.close
with open('test_input.txt', 'w') as g:
g.write('Inputs\n\
\\t ')
g.close
import pandas
table_file = 'test.txt'
input_file = 'test_input.txt'
inputs = pandas.read_csv('test_input.txt', sep = '\n', dtype = str)
inputs = inputs['Inputs']
sepa = inputs[0]
print(sepa)
with open(table_file) as f:
next(f)
for line in f:
line_stripped = line.rstrip('\n')
line_list_1 = line_stripped.split(sepa)
line_list_2 = line_stripped.split('\t')
print(line_list_1)
print(line_list_2)
Prints:
\t
['1 2 3 4 7 8 9 10 ']
['1 2 3 4 7 8 9 10 ']
Could you clarify on the details of the system you are trying to run the code on?
CodePudding user response:
So far this is the ugly workaround I've come up with. The content of "input_test.txt" was changed to:
Inputs
tab
Meanwhile, the code was changed to:
import pandas
table_file = 'test.txt'
input_file = 'test_input.txt'
inputs = pandas.read_csv('test_input.txt', sep = '\n', dtype = str)
inputs = inputs['Inputs']
sepa = inputs[0]
with open(table_file) as f:
next(f)
for line in f:
line_stripped = line.rstrip('\n')
if sepa == 'tab':
line_list_1 = line_stripped.split('\t')
else:
line_list_1 = line_stripped.split(sepa)
line_list_2 = line_stripped.split('\t')
print(line_list_1)
print(line_list_2)
The correct output is printed:
['1', '2', '3', '4', '', '', '7', '8', '9', '10']
['1', '2', '3', '4', '', '', '7', '8', '9', '10']
This works because tab is the only separator with an escape character I am reasonably likely to run in to in the context of what I'm doing, but it's not very satisfying.