I have a genomic dataset consisting of more than 3500 rows. I need to remove rows in two columns that("Length" and "Protein Name") from them. How do I specify the condition for this purpose.
import csv #importing the csv module or method
#opening a new csv file
file = open('C:\\Users\\Admin\\Downloads\\csv.csv', 'r')
type(file)
#reading the csv file
csvreader = csv.reader(file)
header = []
header = next(csvreader)
print(header)
#extracting rows from the csv file
rows = []
for row in csvreader:
rows.append(row)
print(rows)
I am a beginner in python bioinformatic data analysis and I haven't tried any extensive methods. I don't how to proceed from here. I have done the work opening and reading the csv file. I have also extracted the column headers. But I don't know how to proceed from here. Please help.
CodePudding user response:
try this :
csvreader= csvreader[csvreader["columnName"].str.contains("string to delete") == False]
CodePudding user response:
It will be better to read scv in pandas since you have lots of row. That will be the smart decision to make. And also set some your conditional variables which you will use to perform the operation. If this do not help. I will suggest you provide a sample data of your scv file.
df = pd.read_csv('C:\\Users\\Admin\\Downloads\\csv.csv')
length = 10
protein_name = "replace with protain name"
df = df[(df["Length"] > length) & (df["Protein Name"] != protein_name)]
print(df)
You can the convert this back to scv file if you want:
df.to_csv("'C:\\Users\\Admin\\Downloads\\new_csv.csv'", index=False)