I'm working with GWAS data which is of 2Million columns and 522 rows. Here I need to replace "00" with "N/A" over data. Since I have a huge file I'm using the open_reader method. can anyone please help
Note: Need to skip the first row and first column
sample data:
ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,00
200,00,TG,00,GT
300,AA,00,CG,AA
400,GG,CC,AA,TA
Desired Output:
ID,kgp11270025,kgp570033,rs707,kgp7500
1,CT,GT,CA,N/A
200,N/A,TG,N/A,GT
300,AA,N/A,CG,AA
400,GG,CC,AA,TA
The code I written:
import re
input_file = "test.csv"
output_file = "testresult.csv"
# print("Processing data from", input_file)
with open(input_file) as f:
lineno = 0
for line in f:
lineno = lineno 1
if (lineno == 1):
#need to skip first line
# print("Skipping line 1 which is a header")
print(line.rstrip())
else:
# print("Processing line {}".format(lineno))
line = re.sub(r',00', ',N/A', line.rstrip())
print(line)
# print("Processed {} lines".format(lineno))
I have tried this but not working, please help!!
CodePudding user response:
when I use
print(line)
, its showing fine
Then just use file
keyword argument of print
as follows
import re
input_file = "test.csv"
output_file = "testresult.csv"
# print("Processing data from", input_file)
with open(input_file) as f, open(output_file, "w") as g:
lineno = 0
for line in f:
lineno = lineno 1
if (lineno == 1):
#need to skip first line
# print("Skipping line 1 which is a header")
print(line.rstrip(),file=g)
else:
# print("Processing line {}".format(lineno))
line = re.sub(r',00', ',N/A', line.rstrip())
print(line,file=g)
# print("Processed {} lines".format(lineno))
Note that whilst opening input file name only is sufficient as default mode is read-text, but specyfing writing mode (w
) is required for output file.
CodePudding user response:
You could use pandas
to do this easily:
import pandas as pd
df = pd.read('test.csv', dtype = str)
df = df.replace('00', 'N/A')
df.to_csv('test-result.csv', index = False)
For very large CSV files, you could do this:
header = True
for chunk in pd.read_csv('test.csv', chunksize = your-chunk-size, type = str):
chunk = chunk.replace('00', 'N/A')
chunk.to_csv('test-result.csv', index = False, header = header, mode = 'a')
header = False