Home > Enterprise >  Agregate two column values from a dataframe together in one column of another dataframe in Python
Agregate two column values from a dataframe together in one column of another dataframe in Python

Time:12-22

I have a dataframe with unique gene IDs and multiple columns like this

Table 1

geneID column 1 column 2 column n...
gene1 ... ...
gene2 ... ...
gene3 ... ...
gene4 ... ...
... ... ...

And another dataframe with unique GO IDs with more than one gene IDs

Table 2

GO_ID Description geneID
GO:58479 one biological description gene1/gene2/gene3
GO:58014 another description gene1
GO:37402 description 3 gene1/gene4/gene6
GO:28390 one more gene2
... ... ...

I don't know if it's possible but I would like to add a new column to the table 1 with all the GO IDs and descriptions for each gene. I want just a space bar separation between ID and description but another separation like "/" or ";" between different GO IDs like this:

Table 3

geneID GO column 1 column n...
gene1 GO:58479 one biological description; GO:58014 another description; GO:37402 description 3;... ...
gene2 GO:58479 one biological description; GO:28390 one more; ... ...
gene3 ... ...
gene4 ... ...
... ... ...

Any idea how to solve it?

CodePudding user response:

Example

we need miniaml and reproducible example code for answer. let's make

data1 = [['id1', 1], ['id2', 2], ['id3', 3]]
data2 = [['go1', 'des1', 'id1/id2/id3'], ['go2', 'des2', 'id1']]
df1 = pd.DataFrame(data1, columns=['id', 'col1'])
df2 = pd.DataFrame(data2, columns=['go_id', 'des', 'id']) 

df1

    id  col1
0   id1 1
1   id2 2
2   id3 3

df2

    go_id   des     id
0   go1     des1    id1/id2/id3
1   go2     des2    id1

Code

1st step. split df2 and explode

df3 = df2.assign(id=df2['id'].str.split('/')).explode('id')

df3

    go_id   des     id
0   go1     des1    id1
0   go1     des1    id2
0   go1     des1    id3
1   go2     des2    id1

2nd step. join and merge
out = (df3['go_id'].str.cat(df3['des'], ' ').groupby(df3['id']).agg(';'.join)
       .reset_index(name='go').merge(df1, how='right'))

out

    id  go                  col1
0   id1 go1 des1;go2 des2   1
1   id2 go1 des1            2
2   id3 go1 des1            3

CodePudding user response:

Here is a proposition with pandas.DataFrame.merge and pandas.DataFrame.explode :

out = (
        df1
          .merge(df2
                  .assign(geneID= df2["geneID"]
                                    .replace("gen", "gene ", regex=True) #read NB below
                                    .str.split("/"),
                          GO= df2[["GO_ID", "Description"]].agg(" ".join, axis=1))
                  .explode("geneID")
                  .groupby("geneID", as_index=False, sort=False)["GO"].agg("; ".join),
                 on="geneID", how="left")
          [["geneID", "GO"]   df1.columns[1:].tolist()]
      )

# Output :

print(out)

   geneID                                                                                         GO column 1 column 2  column n...
0  gene 1  GO:58479 one biological description; GO:58014 another description; GO:37402 description 3      ...      ...          NaN
1  gene 2                                     GO:58479 one biological description; GO:28390 one more      ...      ...          NaN
2  gene 3                                                        GO:58479 one biological description      ...      ...          NaN
3  gene 4                                                                     GO:37402 description 3      ...      ...          NaN

NB : In your table2, you write (gen1) instead of (gene 1). So I'm not sure if it's a typo or not but if it's the case, remove the string replacement line.

  • Related