Home > other >  merging data frames by conditions and saving common values belonging to the same column efficiently
merging data frames by conditions and saving common values belonging to the same column efficiently

Time:03-17

I have two data frames that I want to merge by the following condition

  1. A common value between one column between the 2 data frame .query("CHROM == Chromosome_hg38")
  2. Region of one value between the regions of the other value .query('START.between(Gene_start_hg38, Gene_stop_hg38) | END.between(Gene_start_hg38, Gene_stop_hg38)')

These are my data frames:

data = [['chr1', 1,10,'Gene1','ID1'], ['chr2',15,20,'Gene2','ID2'],['chr2',21,40,'Gene3','ID3']]
LOUEF = pd.DataFrame(data, columns = ['Chromosome_hg38', 'Gene_start_hg38', 'Gene_stop_hg38','Gene_name','geneID'])
LOUEF

Chromosome_hg38 Gene_start_hg38 Gene_stop_hg38  Gene_name   geneID
0   chr1        1               10              Gene1        ID1
1   chr2        15              20              Gene2        ID2
2   chr2        21              40              Gene3        ID3

data2 = [['chr1', 3,11,'location1'], ['chr1',11,17,'location2'],['chr2',20,30,'location3'],['chr2',15,17,'location3'],['chr5',1,19,'location4']]
cnvs = pd.DataFrame(data2, columns = ['CHROM', 'START', 'END','locations'])


CHROM   START   END locations
0   chr1    3   11  location1
1   chr1    11  17  location2
2   chr2    20  30  location3
3   chr2    15  17  location3
4   chr5    1   19  location4

The code I am using

cnvs['Gene_name'] = (
    cnvs.loc[cnvs['CHROM'].isin(LOUEF['Chromosome_hg38'])]   
       .reset_index().merge(LOUEF, how='cross', suffixes=('', '_'))
       .query("CHROM == Chromosome_hg38")
       .query('START.between(Gene_start_hg38, Gene_stop_hg38) | END.between(Gene_start_hg38, Gene_stop_hg38)')
       .groupby('index')['Gene_name'].apply(list).str.join(';')
)

With this code I got this

CHROM   START   END locations   Gene_name
0   chr1    3   11  location1   Gene1
1   chr1    11  17  location2   NaN
2   chr2    20  30  location3   Gene2;Gene2
3   chr2    15  17  location3   Gene2
4   chr5    1   19  location4   NaN

As I have done with gene_name, how could I do the same with geneID without the need of repeat the same code again?

CHROM   START   END locations   Gene_name    geneID
0   chr1    3   11  location1   Gene1        ID1
1   chr1    11  17  location2   NaN          NaN
2   chr2    20  30  location3   Gene2;Gene2  ID2;ID2
3   chr2    15  17  location3   Gene2        ID2
4   chr5    1   19  location4   NaN          NaN

CodePudding user response:

This way?

for key in ("Gene_name", "geneID"):
    cnvs[key] = (
        cnvs.loc[cnvs['CHROM'].isin(LOUEF['Chromosome_hg38'])]   
           .reset_index().merge(LOUEF, how='cross', suffixes=('', '_'))
           .query("CHROM == Chromosome_hg38")
           .query('START.between(Gene_start_hg38, Gene_stop_hg38) | END.between(Gene_start_hg38, Gene_stop_hg38)')
           .groupby('index')[key].apply(list).str.join(';')
    )
  • Related