I have a dataframe that with columns of organisms that contain lists of gene names that are associated with a particular orthogroup. I also have a list of candidate genes that I would like to interrogate this dataframe with.
What I am trying to do is iterate through the dataframe and only keep the part of the string that matches any of the candidate genes in the gene list. If there are no matches for any of the strings in a row, I'd like to just drop that row entirely.
What I have tried: I tried using the solution linked here, which works perfectly, but only for one column - I want to do this for all columns.
I have also tried to use some combinations of np.where() without any yield.
Reproducible data:
Dataframe example:
{'Orthogroup': {0: 'OG0000000',
1: 'OG0000001',
2: 'OG0000002',
3: 'OG0000003',
4: 'OG0000004'},
'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1, WP_113526628.1',
1: 'WP_009560941.1, WP_009567366.1, WP_012536141.1, WP_012536296.1, WP_012536581.1, WP_012537464.1, WP_012537498.1, WP_041645259.1, WP_229129755.1',
2: nan,
3: 'WP_009562369.1',
4: 'WP_009564938.1, WP_012535757.1, WP_012535999.1, WP_012536379.1, WP_012536404.1, WP_012536894.1, WP_012537015.1, WP_012537128.1, WP_232027572.1'},
'acidithiobacillus_thiooxidans': {0: 'WP_010639275.1, WP_010639739.1, WP_010640967.1, WP_029316246.1, WP_157999017.1, WP_157999033.1, WP_157999054.1',
1: 'WP_010637203.1, WP_010637257.1, WP_010639175.1, WP_010639301.1, WP_010639864.1, WP_010639966.1, WP_010639971.1, WP_010640895.1, WP_010641185.1, WP_050816103.1, WP_226825859.1, WP_226826127.1',
2: 'WP_235179950.1',
3: 'WP_010639365.1, WP_010642423.1',
4: 'WP_010637024.1, WP_010637323.1, WP_010637554.1, WP_010637935.1, WP_010639011.1, WP_010639453.1, WP_010640567.1, WP_010642114.1, WP_024893638.1'},
'burkholderia_pseudomallei': {0: 'WP_004185903.1, WP_004190283.1, WP_004528082.1, WP_004530921.1, WP_004534726.1, WP_011204972.1',
1: 'WP_004187662.1, WP_004187694.1, WP_004188321.1, WP_004189039.1, WP_004192300.1, WP_004198029.1, WP_004526108.1, WP_004538115.1, WP_004551242.1, WP_004553906.1, WP_004554106.1, WP_009931394.1',
2: 'WP_004187329.1, WP_004191606.1, WP_004195354.1, WP_004524029.1, WP_004524162.1, WP_004524990.1, WP_004525719.1, WP_004526156.1, WP_004529761.1, WP_004534748.1, WP_004552066.1, WP_004552880.1, WP_004553008.1, WP_004553113.1, WP_004554092.1, WP_014696855.1, WP_024431032.1, WP_024431563.1, WP_038742977.1, WP_038794545.1, WP_038803065.1',
3: 'WP_004185837.1, WP_004187646.1, WP_004188318.1, WP_004189230.1, WP_004189405.1, WP_004192049.1, WP_004192063.1, WP_004194581.1, WP_004197652.1, WP_004198340.1, WP_004266649.1, WP_004521296.1, WP_004522591.1, WP_004522925.1, WP_004523005.1, WP_004524197.1, WP_004525408.1, WP_004525453.1, WP_004528137.1, WP_004529948.1, WP_004531197.1, WP_004536377.1, WP_004541514.1, WP_004543954.1, WP_004553290.1, WP_004553756.1, WP_004553993.1, WP_011204859.1',
4: 'WP_004193481.1, WP_004195334.1, WP_004195802.1, WP_004196705.1, WP_004196791.1, WP_004200681.1, WP_004521379.1, WP_004524701.1, WP_004526731.1, WP_004529133.1, WP_004535483.1, WP_004537932.1, WP_004552076.1, WP_038729337.1'},
'e_coli': {0: 'WP_000043761.1, WP_000859945.1, WP_000969032.1, WP_001266293.1, WP_001301264.1, WP_001310896.1, WP_001360132.1',
1: 'WP_000125282.1, WP_000148503.1, WP_000417791.1, WP_000941041.1, WP_001188777.1, WP_001251544.1, WP_001295369.1, WP_001301108.1, WP_001301332.1',
2: 'WP_000483239.1, WP_000919536.1, WP_001098559.1, WP_001297437.1',
3: 'WP_000164036.1, WP_000357790.1, WP_000440317.1, WP_000648572.1, WP_000817708.1, WP_000935206.1, WP_001061575.1, WP_001242684.1',
4: 'WP_000365791.1, WP_000423261.1, WP_000442949.1, WP_000548294.1, WP_000952503.1, WP_001011462.1, WP_001019525.1, WP_001109794.1, WP_001300638.1, WP_001350493.1'}}
Candidate genes example:
["WP_012536716.1","WP_004196791.1","WP_001297437.1"]
Ideal output: Ideally, this should look like:
{'Orthogroup': {0: 'OG0000000',
2: 'OG0000002',
4: 'OG0000004'},
'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1',
2: nan,
4: nan},
'acidithiobacillus_thiooxidans': {0: nan,
2: nan,
4: nan},
'burkholderia_pseudomallei': {0: nan,
2: nan,
4: 'WP_004196791.1'},
'e_coli': {0: nan,
2: 'WP_001297437.1',
4: nan}}
Any help is greatly appreciated!
Thanks.
CodePudding user response:
Use custom lambda function with DataFrame.applymap
for filtering splitted values by candidates
, if no match get empty strings, so last add DataFrame.replace
for replace empty strings to missing values:
candidates = set(["WP_012536716.1","WP_004196791.1","WP_001297437.1"])
f = lambda x: ', '.join(y for y in x.split(', ') if y in candidates)
df = (df.set_index('Orthogroup')
.fillna('')
.applymap(f)
.replace('', np.nan)
.dropna(how='all')
.reset_index())
print (df)
Orthogroup acidithiobacillus_ferrooxidans acidithiobacillus_thiooxidans \
0 OG0000000 WP_012536716.1 NaN
1 OG0000002 NaN NaN
2 OG0000004 NaN NaN
burkholderia_pseudomallei e_coli
0 NaN NaN
1 NaN WP_001297437.1
2 WP_004196791.1 NaN
CodePudding user response:
You have no choice but to loop over all cells (which will be slow if you have many) using applymap
.
For that you can use a custom function:
candidates = ["WP_012536716.1","WP_004196791.1","WP_001297437.1"]
S = set(candidates)
def process(x):
if pd.isna(x):
return x
out = ', '.join(S.intersection(x.split(', ')))
if out:
return out
df2 = df.set_index('Orthogroup').applymap(process).dropna(how='all').reset_index()
output:
Orthogroup acidithiobacillus_ferrooxidans acidithiobacillus_thiooxidans burkholderia_pseudomallei e_coli
0 OG0000000 WP_012536716.1 NaN NaN NaN
1 OG0000002 NaN NaN NaN WP_001297437.1
2 OG0000004 NaN NaN WP_004196791.1 NaN