I'm working with a CSV file which is shown here below:
strain contig homology
1A42 ctg.s1.000000F Chr
1A42 ctg.s1.000001F pSymA
1A42 ctg.s1.3 pSymB
1A42 ctg.s2.000000F Other
4B41 ctg.s1.000000F Chr
4B41 ctg.s1.3 pSymA
4B41 ctg.s1.1 pSymB
7B22 ctg.s2.12 other
7B22 ctg.s1.000000F Chr
7B22 ctg.s1.3 pSymA
7B22 ctg.s1.1 pSymB
8A52 ctg.s1.0 pSymB
8A52 ctg.s1.4 Chr
8A52 ctg.s1.2 pSymA
In the contig
column some strings are repeated among the different strains of the strain
column. For example, ctg.s1.000000F
is present in 1A42
, 4B41
and 7B22
.
I wrote the following lines of code to define a function in which, given the strain name and the CSV file as input, it prints back the corresponding value in the homology
column for each contig
value:
def myfunction(strain, csv):
with open(csv, 'r') as h:
h_df = pd.read_csv(h, index_col=False, dtype='unicode', on_bad_lines='skip', sep=";")
match = h_df.loc[(h_df == strain).any(1), 'contig']
for element in match:
contig = h_df.loc[(h_df == element).any(1), 'homology']
print(element, contig)
myfunction(1A42, mycsv)
It actually works but returns me the homology values of the entire column and only the "1A42" related ones.
How can I do that? Thank you.
CodePudding user response:
Try this (simpler) approach:
def myfunction(strain, csv):
with open(csv, 'r') as h:
h_df = pd.read_csv(h, index_col=False, dtype='unicode', on_bad_lines='skip', sep=";")
print(h_df[h_df['strain'] == strain][['homology', 'contig']])
CodePudding user response:
If you want to print only the homology values corresponding to the strain you're interested in, you can use the following code:
def myfunction(strain, csv):
with open(csv, 'r') as h:
h_df = pd.read_csv(h, index_col=False, dtype='unicode', on_bad_lines='skip', sep=";")
match = h_df.loc[(h_df['strain'] == strain), 'contig']
for element in match:
contig = h_df.loc[(h_df['contig'] == element) & (h_df['strain'] == strain), 'homology']
print(element, contig.iloc[0])
myfunction('1A42', 'mycsv')