How to remove parts of a string for multiple columns in a pandas dataframe-CodePudding

I have a dataframe that with columns of organisms that contain lists of gene names that are associated with a particular orthogroup. I also have a list of candidate genes that I would like to interrogate this dataframe with.

What I am trying to do is iterate through the dataframe and only keep the part of the string that matches any of the candidate genes in the gene list. If there are no matches for any of the strings in a row, I'd like to just drop that row entirely.

What I have tried: I tried using the solution linked here, which works perfectly, but only for one column - I want to do this for all columns.

I have also tried to use some combinations of np.where() without any yield.

Reproducible data:

Dataframe example:

{'Orthogroup': {0: 'OG0000000',
  1: 'OG0000001',
  2: 'OG0000002',
  3: 'OG0000003',
  4: 'OG0000004'},
 'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1, WP_113526628.1',
  1: 'WP_009560941.1, WP_009567366.1, WP_012536141.1, WP_012536296.1, WP_012536581.1, WP_012537464.1, WP_012537498.1, WP_041645259.1, WP_229129755.1',
  2: nan,
  3: 'WP_009562369.1',
  4: 'WP_009564938.1, WP_012535757.1, WP_012535999.1, WP_012536379.1, WP_012536404.1, WP_012536894.1, WP_012537015.1, WP_012537128.1, WP_232027572.1'},
 'acidithiobacillus_thiooxidans': {0: 'WP_010639275.1, WP_010639739.1, WP_010640967.1, WP_029316246.1, WP_157999017.1, WP_157999033.1, WP_157999054.1',
  1: 'WP_010637203.1, WP_010637257.1, WP_010639175.1, WP_010639301.1, WP_010639864.1, WP_010639966.1, WP_010639971.1, WP_010640895.1, WP_010641185.1, WP_050816103.1, WP_226825859.1, WP_226826127.1',
  2: 'WP_235179950.1',
  3: 'WP_010639365.1, WP_010642423.1',
  4: 'WP_010637024.1, WP_010637323.1, WP_010637554.1, WP_010637935.1, WP_010639011.1, WP_010639453.1, WP_010640567.1, WP_010642114.1, WP_024893638.1'},
 'burkholderia_pseudomallei': {0: 'WP_004185903.1, WP_004190283.1, WP_004528082.1, WP_004530921.1, WP_004534726.1, WP_011204972.1',
  1: 'WP_004187662.1, WP_004187694.1, WP_004188321.1, WP_004189039.1, WP_004192300.1, WP_004198029.1, WP_004526108.1, WP_004538115.1, WP_004551242.1, WP_004553906.1, WP_004554106.1, WP_009931394.1',
  2: 'WP_004187329.1, WP_004191606.1, WP_004195354.1, WP_004524029.1, WP_004524162.1, WP_004524990.1, WP_004525719.1, WP_004526156.1, WP_004529761.1, WP_004534748.1, WP_004552066.1, WP_004552880.1, WP_004553008.1, WP_004553113.1, WP_004554092.1, WP_014696855.1, WP_024431032.1, WP_024431563.1, WP_038742977.1, WP_038794545.1, WP_038803065.1',
  3: 'WP_004185837.1, WP_004187646.1, WP_004188318.1, WP_004189230.1, WP_004189405.1, WP_004192049.1, WP_004192063.1, WP_004194581.1, WP_004197652.1, WP_004198340.1, WP_004266649.1, WP_004521296.1, WP_004522591.1, WP_004522925.1, WP_004523005.1, WP_004524197.1, WP_004525408.1, WP_004525453.1, WP_004528137.1, WP_004529948.1, WP_004531197.1, WP_004536377.1, WP_004541514.1, WP_004543954.1, WP_004553290.1, WP_004553756.1, WP_004553993.1, WP_011204859.1',
  4: 'WP_004193481.1, WP_004195334.1, WP_004195802.1, WP_004196705.1, WP_004196791.1, WP_004200681.1, WP_004521379.1, WP_004524701.1, WP_004526731.1, WP_004529133.1, WP_004535483.1, WP_004537932.1, WP_004552076.1, WP_038729337.1'},
 'e_coli': {0: 'WP_000043761.1, WP_000859945.1, WP_000969032.1, WP_001266293.1, WP_001301264.1, WP_001310896.1, WP_001360132.1',
  1: 'WP_000125282.1, WP_000148503.1, WP_000417791.1, WP_000941041.1, WP_001188777.1, WP_001251544.1, WP_001295369.1, WP_001301108.1, WP_001301332.1',
  2: 'WP_000483239.1, WP_000919536.1, WP_001098559.1, WP_001297437.1',
  3: 'WP_000164036.1, WP_000357790.1, WP_000440317.1, WP_000648572.1, WP_000817708.1, WP_000935206.1, WP_001061575.1, WP_001242684.1',
  4: 'WP_000365791.1, WP_000423261.1, WP_000442949.1, WP_000548294.1, WP_000952503.1, WP_001011462.1, WP_001019525.1, WP_001109794.1, WP_001300638.1, WP_001350493.1'}}

Candidate genes example:

["WP_012536716.1","WP_004196791.1","WP_001297437.1"]

Ideal output: Ideally, this should look like:

{'Orthogroup': {0: 'OG0000000',
  2: 'OG0000002',
  4: 'OG0000004'},
 'acidithiobacillus_ferrooxidans': {0: 'WP_012536716.1',
  2: nan,
  4: nan},
 'acidithiobacillus_thiooxidans': {0: nan,
  2: nan,
  4: nan},
 'burkholderia_pseudomallei': {0: nan,
  2: nan,
  4: 'WP_004196791.1'},
 'e_coli': {0: nan,
  2: 'WP_001297437.1',
  4: nan}}

Any help is greatly appreciated!

Thanks.

CodePudding user response：

Use custom lambda function with DataFrame.applymap for filtering splitted values by candidates, if no match get empty strings, so last add DataFrame.replace for replace empty strings to missing values:

candidates = set(["WP_012536716.1","WP_004196791.1","WP_001297437.1"])


f = lambda x: ', '.join(y for y in x.split(', ') if y in candidates)
df = (df.set_index('Orthogroup')
        .fillna('')
        .applymap(f)
        .replace('', np.nan)
        .dropna(how='all')
        .reset_index())
print (df)
  Orthogroup acidithiobacillus_ferrooxidans  acidithiobacillus_thiooxidans  \
0  OG0000000                 WP_012536716.1                            NaN   
1  OG0000002                            NaN                            NaN   
2  OG0000004                            NaN                            NaN   

  burkholderia_pseudomallei          e_coli  
0                       NaN             NaN  
1                       NaN  WP_001297437.1  
2            WP_004196791.1             NaN

CodePudding user response：

You have no choice but to loop over all cells (which will be slow if you have many) using applymap.

For that you can use a custom function:

candidates = ["WP_012536716.1","WP_004196791.1","WP_001297437.1"]

S = set(candidates)

def process(x):
    if pd.isna(x):
        return x
    out = ', '.join(S.intersection(x.split(', ')))
    if out:
        return out

df2 = df.set_index('Orthogroup').applymap(process).dropna(how='all').reset_index()

output:

  Orthogroup acidithiobacillus_ferrooxidans  acidithiobacillus_thiooxidans burkholderia_pseudomallei          e_coli
0  OG0000000                 WP_012536716.1                            NaN                       NaN             NaN
1  OG0000002                            NaN                            NaN                       NaN  WP_001297437.1
2  OG0000004                            NaN                            NaN            WP_004196791.1             NaN