Home > Net >  Add a new column with value from another column matching a pattern
Add a new column with value from another column matching a pattern

Time:05-28

I have a dataframe such as :

COL1 COL2 
A    eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus
B    viruses; Retroviridae
C    viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus
D    Unclassified; RNA virus 

And I would like to parse the COL2 column with element separated by ";" and add a COL3 column with the element containing "viridae" for each row.

I should then get:

COL1 COL2                                                           COL3
A    eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus     carnivoridae
B    viruses; Retroviridae                                          Retroviridae
C    viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus Phenuiviridae
D    Unclassified; RNA virus                                        NA

Does someone have an idea please ?

Here is the dataframe in dict format if it can helps

{'COL1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'COL2 ': {0: 'eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus', 1: 'viruses; Retroviridae', 2: 'viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus', 3: 'Unclassified; RNA virus '}}

CodePudding user response:

You could do something like this:

import pandas as pd
import re

df = {'COL1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'COL2': {0: 'eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus', 1: 'viruses; Retroviridae', 2: 'viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus', 3: 'Unclassified; RNA virus '}}

df = pd.DataFrame(df)

You can then use the following method: First tranform the column into a column of lists

df['COL2_list']= df['COL2'].str.split(';')
df = df.reset_index()

Then parse through every row of the df to find th strings you want (here I choose 'ridae'):

DF = []
for i in range(len(df)):
    a = df[df.index==i]
    b = [string for string in a['COL2_list'][i] if 'ridae' in string] 
    a = np.where(len(b)!=0, b,'NAN')
    DF.append(a)

DF = pd.DataFrame(DF, columns = ['COL3'])
DF

which gives you

 COL3
0    carnivoridae
1    Retroviridae
2   Phenuiviridae
3            None

and then concatenate the results:

Full = pd.concat([df,DF], axis=1)

which gives:

index COL1                                               COL2  \
0      0    A  eucaryotes; mammal; carnivoridae; carnivorinae...   
1      1    B                              viruses; Retroviridae   
2      2    C  viruses; mononegavirales; Phenuiviridae; Ascov...   
3      3    D                           Unclassified; RNA virus    

                                           COL2_list            COL3  
0  [eucaryotes,  mammal,  carnivoridae,  carnivor...    carnivoridae  
1                           [viruses,  Retroviridae]    Retroviridae  
2  [viruses,  mononegavirales,  Phenuiviridae,  A...   Phenuiviridae  
3                        [Unclassified,  RNA virus ]            None  

Slightly different from what you wrote but it is because you didn't spell things right.

CodePudding user response:

To match your example:

df.assign(COL3=df['COL2'].str.extract('(\w v[io]ridae)'))

  COL1                                               COL2           COL3
0    A  eucaryotes; mammal; carnivoridae; carnivorinae...   carnivoridae
1    B                              viruses; Retroviridae   Retroviridae
2    C  viruses; mononegavirales; Phenuiviridae; Ascov...  Phenuiviridae
3    D                            Unclassified; RNA virus            NaN

To match your stated requirement of finding words ending with viridae:

df.assign(COL3=df['COL2'].str.extract('(\w viridae)'))

  COL1                                               COL2           COL3
0    A  eucaryotes; mammal; carnivoridae; carnivorinae...            NaN
1    B                              viruses; Retroviridae   Retroviridae
2    C  viruses; mononegavirales; Phenuiviridae; Ascov...  Phenuiviridae
3    D                            Unclassified; RNA virus            NaN
  • Related