I have a dataframe with unique gene IDs and multiple columns like this
Table 1
geneID | column 1 | column 2 | column n... |
---|---|---|---|
gene1 | ... | ... | |
gene2 | ... | ... | |
gene3 | ... | ... | |
gene4 | ... | ... | |
... | ... | ... |
And another dataframe with unique GO IDs with more than one gene IDs
Table 2
GO_ID | Description | geneID |
---|---|---|
GO:58479 | one biological description | gene1/gene2/gene3 |
GO:58014 | another description | gene1 |
GO:37402 | description 3 | gene1/gene4/gene6 |
GO:28390 | one more | gene2 |
... | ... | ... |
I don't know if it's possible but I would like to add a new column to the table 1 with all the GO IDs and descriptions for each gene. I want just a space bar separation between ID and description but another separation like "/" or ";" between different GO IDs like this:
Table 3
geneID | GO | column 1 | column n... |
---|---|---|---|
gene1 | GO:58479 one biological description; GO:58014 another description; GO:37402 description 3;... | ... | |
gene2 | GO:58479 one biological description; GO:28390 one more; ... | ... | |
gene3 | ... | ... | |
gene4 | ... | ... | |
... | ... | ... |
Any idea how to solve it?
CodePudding user response:
Example
we need miniaml and reproducible example code for answer. let's make
data1 = [['id1', 1], ['id2', 2], ['id3', 3]]
data2 = [['go1', 'des1', 'id1/id2/id3'], ['go2', 'des2', 'id1']]
df1 = pd.DataFrame(data1, columns=['id', 'col1'])
df2 = pd.DataFrame(data2, columns=['go_id', 'des', 'id'])
df1
id col1
0 id1 1
1 id2 2
2 id3 3
df2
go_id des id
0 go1 des1 id1/id2/id3
1 go2 des2 id1
Code
1st step. split df2 and explode
df3 = df2.assign(id=df2['id'].str.split('/')).explode('id')
df3
go_id des id
0 go1 des1 id1
0 go1 des1 id2
0 go1 des1 id3
1 go2 des2 id1
2nd step. join and merge
out = (df3['go_id'].str.cat(df3['des'], ' ').groupby(df3['id']).agg(';'.join)
.reset_index(name='go').merge(df1, how='right'))
out
id go col1
0 id1 go1 des1;go2 des2 1
1 id2 go1 des1 2
2 id3 go1 des1 3
CodePudding user response:
Here is a proposition with pandas.DataFrame.merge
and pandas.DataFrame.explode
:
out = (
df1
.merge(df2
.assign(geneID= df2["geneID"]
.replace("gen", "gene ", regex=True) #read NB below
.str.split("/"),
GO= df2[["GO_ID", "Description"]].agg(" ".join, axis=1))
.explode("geneID")
.groupby("geneID", as_index=False, sort=False)["GO"].agg("; ".join),
on="geneID", how="left")
[["geneID", "GO"] df1.columns[1:].tolist()]
)
# Output :
print(out)
geneID GO column 1 column 2 column n...
0 gene 1 GO:58479 one biological description; GO:58014 another description; GO:37402 description 3 ... ... NaN
1 gene 2 GO:58479 one biological description; GO:28390 one more ... ... NaN
2 gene 3 GO:58479 one biological description ... ... NaN
3 gene 4 GO:37402 description 3 ... ... NaN
NB :
In your table2, you write (gen1
) instead of (gene 1
). So I'm not sure if it's a typo or not but if it's the case, remove the string replacement line.