Identifying and removing extra tabs in a tab delimited text file-CodePudding

I have a text files with 120 columns and thousands of rows where the delimiter is a tab. In some rows, there is an extra tab present making it seem, in that row, like there are 121 columns. The location of this extra tab is not known to be the same for all the text files.

I am wondering if anyone has any thoughts on efficiently locating the extra tab and removing it programmatically.

CodePudding user response：

You can use a regex as separator in read_csv.

Use '\t ' (one or more tabulations):

df = pd.read_csv('your_file.csv', sep='\t ', engine='python')

CodePudding user response：

import re
file = open("file.csv", "r")
new_content = ""
for line in file:
    line = line.strip()
    re.sub(r"[\\t] ", "\t", line)
    new_content = new_content   line   "\n"

file.close()
new_file = open("file.csv", "w")
new_file.write(new_content)
new_file.close()

You can load your csv file as above and loop each line and replace double tabs if exist and create new content and write new content to same csv.