I have a dataframe such as :
COL1 COL2
A eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus
B viruses; Retroviridae
C viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus
D Unclassified; RNA virus
And I would like to parse the COL2 column with element separated by ";"
and add a COL3 column with the element containing "viridae" for each row.
I should then get:
COL1 COL2 COL3
A eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus carnivoridae
B viruses; Retroviridae Retroviridae
C viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus Phenuiviridae
D Unclassified; RNA virus NA
Does someone have an idea please ?
Here is the dataframe in dict format if it can helps
{'COL1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'COL2 ': {0: 'eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus', 1: 'viruses; Retroviridae', 2: 'viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus', 3: 'Unclassified; RNA virus '}}
CodePudding user response:
You could do something like this:
import pandas as pd
import re
df = {'COL1': {0: 'A', 1: 'B', 2: 'C', 3: 'D'}, 'COL2': {0: 'eucaryotes; mammal; carnivoridae; carnivorinae; carnivorus', 1: 'viruses; Retroviridae', 2: 'viruses; mononegavirales; Phenuiviridae; Ascovirinae; Reovirus', 3: 'Unclassified; RNA virus '}}
df = pd.DataFrame(df)
You can then use the following method: First tranform the column into a column of lists
df['COL2_list']= df['COL2'].str.split(';')
df = df.reset_index()
Then parse through every row of the df to find th strings you want (here I choose 'ridae'):
DF = []
for i in range(len(df)):
a = df[df.index==i]
b = [string for string in a['COL2_list'][i] if 'ridae' in string]
a = np.where(len(b)!=0, b,'NAN')
DF.append(a)
DF = pd.DataFrame(DF, columns = ['COL3'])
DF
which gives you
COL3
0 carnivoridae
1 Retroviridae
2 Phenuiviridae
3 None
and then concatenate the results:
Full = pd.concat([df,DF], axis=1)
which gives:
index COL1 COL2 \
0 0 A eucaryotes; mammal; carnivoridae; carnivorinae...
1 1 B viruses; Retroviridae
2 2 C viruses; mononegavirales; Phenuiviridae; Ascov...
3 3 D Unclassified; RNA virus
COL2_list COL3
0 [eucaryotes, mammal, carnivoridae, carnivor... carnivoridae
1 [viruses, Retroviridae] Retroviridae
2 [viruses, mononegavirales, Phenuiviridae, A... Phenuiviridae
3 [Unclassified, RNA virus ] None
Slightly different from what you wrote but it is because you didn't spell things right.
CodePudding user response:
To match your example:
df.assign(COL3=df['COL2'].str.extract('(\w v[io]ridae)'))
COL1 COL2 COL3
0 A eucaryotes; mammal; carnivoridae; carnivorinae... carnivoridae
1 B viruses; Retroviridae Retroviridae
2 C viruses; mononegavirales; Phenuiviridae; Ascov... Phenuiviridae
3 D Unclassified; RNA virus NaN
To match your stated requirement of finding words ending with viridae:
df.assign(COL3=df['COL2'].str.extract('(\w viridae)'))
COL1 COL2 COL3
0 A eucaryotes; mammal; carnivoridae; carnivorinae... NaN
1 B viruses; Retroviridae Retroviridae
2 C viruses; mononegavirales; Phenuiviridae; Ascov... Phenuiviridae
3 D Unclassified; RNA virus NaN