Is there a way to create a new column based on a substring and text that follow it?-CodePudding

I am working with triqler data outputs. I have been able to convert it into a pandas dataframe that looks like the following:

| A | B |
|:---- |:----
| Jan05 GN=XYZ NA| Q9GLD3;A0A0A0RBT6| |
| Dec25 GN=ZYX Y | A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198|
| Nov12 GN=NHAN Y| P60524;A0A1K0GGH0;A0A8I3P9B9|
|May22 GN=GZV X|P09582;A0A1R3UGQ4;A0A8I3NWV7|
|Jun24 MNIAV X|P09582;A0A1R3UGQ4;A0A8I3NWV7|
|May22 CAUCGZV Y|P09582;A0A1R3UGQ4;A0A8I3NWV7|

I need to create a new column that either uses the string after the "GN=" from the values in column A or the first part before the semicolon in column B if there is no "GN=" in column A.

In other words, I need it to look like this:

| A | B | C|
|:---- |:----|:----|
| Jan05 GN=XYZ NA| Q9GLD3;A0A0A0RBT6| XYZ|
| Dec25 GN=ZYX Y | A0A8I3PIE0;A0A8I3PEP4;A0A0B4J198|ZYX|
| Nov12 GN=NHAN Y| P60524;A0A1K0GGH0;A0A8I3P9B9|NHAN|
|May22 GN=GZV X|P09582;A0A1R3UGQ4;A0A8I3NWV7|GZV|
|Jun24 MNIAV X|P09582;A0A1R3UGQ4;A0A8I3NWV7|P09582|
|May22 CAUCGZV Y|P09582;A0A1R3UGQ4;A0A8I3NWV7|P09582|

I am not sure which packages or commands I should be using to accomplish this goal. I think that a conditional statement would be the way to go.

CodePudding user response：

Adding on the previous answer to eliminate NA

import pandas as pd
#example of data
cols=['A','B']
data=[['Jan05 GN=XYZ NA', 'Q9GLD3;A0A0A0RBT6'],['Jan05 GN=XYZ NA','Q9GLD3;A0A0A0RBT6'],['Jun24 MNIAV X','P09582;A0A1R3UGQ4;A0A8I3NWV7']]
df=pd.DataFrame(data,columns=cols)
df
df['C']=df.apply(lambda x:x["A"].split("GN=")[1].split()[0] if("GN="in x["A"]) else x["B"].split(";")[0],axis=1)

CodePudding user response：

try this

data['new_column']=data.apply(lambda x:x["A"].split("GN=")[1] if("GN="in x["A"]) else x["B"].split(";")[0],axis=1)