Home > Enterprise >  pandas: merge strings when other columns satisfy a condition
pandas: merge strings when other columns satisfy a condition

Time:11-30

I have a table:

genome  start   end   strand    etc
GUT_GENOME270877.fasta  98  396     
GUT_GENOME270877.fasta  384 574 -
GUT_GENOME270877.fasta  593 984  
GUT_GENOME270877.fasta  991 999 -

I'd like to make a new table with column coordinates, which joins start and end columns and looking like this:

genome  start   end   strand    etc   coordinates
GUT_GENOME270877.fasta  98  396     98..396
GUT_GENOME270877.fasta  384 574 -   complement(384..574)
GUT_GENOME270877.fasta  593 984     593..984
GUT_GENOME270877.fasta  991 999 -   complement(991..999)

so that if there's a - in the etc column, I'd like to do not just

df['coordinates'] = df['start'].astype(str) '..' df['end'].astype(str)

but to add brackets and complement, like this:

df['coordinates'] = 'complement(' df['start'].astype(str) '..' df['end'].astype(str) ')'

The only things i'm missing is how to introduce the condition.

CodePudding user response:

You can use numpy.where:

m = df['strand'].eq('-')

df['coordinates'] = (np.where(m, 'complement(', '')
                     df['start'].astype(str) '..' df['end'].astype(str)
                     np.where(m, ')', '')
                    )

Or boolean indexing:

m = df['strand'].eq('-')

df['coordinates'] = df['start'].astype(str) '..' df['end'].astype(str)

df.loc[m, 'coordinates'] = 'complement(' df.loc[m, 'coordinates'] ')'

Output:

                   genome  start  end strand           coordinates
0  GUT_GENOME270877.fasta     98  396                      98..396
1  GUT_GENOME270877.fasta    384  574      -  complement(384..574)
2  GUT_GENOME270877.fasta    593  984                     593..984
3  GUT_GENOME270877.fasta    991  999      -  complement(991..999)
  • Related