Home > OS >  Pandas: indexing a multidimensional key
Pandas: indexing a multidimensional key

Time:01-31

I'm working with a CSV file which is shown here below:

strain  contig          homology
1A42    ctg.s1.000000F  Chr
1A42    ctg.s1.000001F  pSymA
1A42    ctg.s1.3        pSymB
1A42    ctg.s2.000000F  Other
4B41    ctg.s1.000000F  Chr
4B41    ctg.s1.3        pSymA
4B41    ctg.s1.1        pSymB
7B22    ctg.s2.12       other
7B22    ctg.s1.000000F  Chr
7B22    ctg.s1.3        pSymA
7B22    ctg.s1.1        pSymB
8A52    ctg.s1.0        pSymB
8A52    ctg.s1.4        Chr
8A52    ctg.s1.2        pSymA

In the contig column some strings are repeated among the different strains of the strain column. For example, ctg.s1.000000F is present in 1A42, 4B41 and 7B22.

I wrote the following lines of code to define a function in which, given the strain name and the CSV file as input, it prints back the corresponding value in the homology column for each contig value:

def myfunction(strain, csv):
    with open(csv, 'r') as h:
        h_df = pd.read_csv(h, index_col=False, dtype='unicode', on_bad_lines='skip', sep=";")
        match = h_df.loc[(h_df == strain).any(1), 'contig']
        for element in match:
            contig = h_df.loc[(h_df == element).any(1), 'homology']
            print(element, contig)


myfunction(1A42, mycsv)

It actually works but returns me the homology values of the entire column and only the "1A42" related ones.

How can I do that? Thank you.

CodePudding user response:

Try this (simpler) approach:

def myfunction(strain, csv):
    with open(csv, 'r') as h:
        h_df = pd.read_csv(h, index_col=False, dtype='unicode', on_bad_lines='skip', sep=";")
    print(h_df[h_df['strain'] == strain][['homology', 'contig']])

CodePudding user response:

If you want to print only the homology values corresponding to the strain you're interested in, you can use the following code:

def myfunction(strain, csv):
    with open(csv, 'r') as h:
        h_df = pd.read_csv(h, index_col=False, dtype='unicode', on_bad_lines='skip', sep=";")
        match = h_df.loc[(h_df['strain'] == strain), 'contig']
        for element in match:
            contig = h_df.loc[(h_df['contig'] == element) & (h_df['strain'] == strain), 'homology']
            print(element, contig.iloc[0])

myfunction('1A42', 'mycsv')
  • Related