I have two data frames that I want to merge by the following condition
- A common value between one column between the 2 data frame
.query("CHROM == Chromosome_hg38")
- Region of one value between the regions of the other value
.query('START.between(Gene_start_hg38, Gene_stop_hg38) | END.between(Gene_start_hg38, Gene_stop_hg38)')
These are my data frames:
data = [['chr1', 1,10,'Gene1','ID1'], ['chr2',15,20,'Gene2','ID2'],['chr2',21,40,'Gene3','ID3']]
LOUEF = pd.DataFrame(data, columns = ['Chromosome_hg38', 'Gene_start_hg38', 'Gene_stop_hg38','Gene_name','geneID'])
LOUEF
Chromosome_hg38 Gene_start_hg38 Gene_stop_hg38 Gene_name geneID
0 chr1 1 10 Gene1 ID1
1 chr2 15 20 Gene2 ID2
2 chr2 21 40 Gene3 ID3
data2 = [['chr1', 3,11,'location1'], ['chr1',11,17,'location2'],['chr2',20,30,'location3'],['chr2',15,17,'location3'],['chr5',1,19,'location4']]
cnvs = pd.DataFrame(data2, columns = ['CHROM', 'START', 'END','locations'])
CHROM START END locations
0 chr1 3 11 location1
1 chr1 11 17 location2
2 chr2 20 30 location3
3 chr2 15 17 location3
4 chr5 1 19 location4
The code I am using
cnvs['Gene_name'] = (
cnvs.loc[cnvs['CHROM'].isin(LOUEF['Chromosome_hg38'])]
.reset_index().merge(LOUEF, how='cross', suffixes=('', '_'))
.query("CHROM == Chromosome_hg38")
.query('START.between(Gene_start_hg38, Gene_stop_hg38) | END.between(Gene_start_hg38, Gene_stop_hg38)')
.groupby('index')['Gene_name'].apply(list).str.join(';')
)
With this code I got this
CHROM START END locations Gene_name
0 chr1 3 11 location1 Gene1
1 chr1 11 17 location2 NaN
2 chr2 20 30 location3 Gene2;Gene2
3 chr2 15 17 location3 Gene2
4 chr5 1 19 location4 NaN
As I have done with gene_name, how could I do the same with geneID without the need of repeat the same code again?
CHROM START END locations Gene_name geneID
0 chr1 3 11 location1 Gene1 ID1
1 chr1 11 17 location2 NaN NaN
2 chr2 20 30 location3 Gene2;Gene2 ID2;ID2
3 chr2 15 17 location3 Gene2 ID2
4 chr5 1 19 location4 NaN NaN
CodePudding user response:
This way?
for key in ("Gene_name", "geneID"):
cnvs[key] = (
cnvs.loc[cnvs['CHROM'].isin(LOUEF['Chromosome_hg38'])]
.reset_index().merge(LOUEF, how='cross', suffixes=('', '_'))
.query("CHROM == Chromosome_hg38")
.query('START.between(Gene_start_hg38, Gene_stop_hg38) | END.between(Gene_start_hg38, Gene_stop_hg38)')
.groupby('index')[key].apply(list).str.join(';')
)