How to extract elements from a list in pandas through regex?-CodePudding

I'm looking to extract the string of numbers that come after 'accession' in this Dataframe. My dataframe looks like this:

targets_list = pd.DataFrame(targets_df[['target_components', 'target_chembl_id']])

and the elements in each column of the target_components looks like the following:

[{'accession': 'O43451', 'component_description': 'Maltase-glucoamylase, intestinal', 'component_id': 434, 'component_type': 'PROTEIN', 'relationship': 'SINGLE PROTEIN', 'target_component_synonyms',...}]

I would just like to extract the number code after 'accession'. As I thought it was the first element of the list, I tried to tgt = targets_list['target_components'][0][0], but this returns the first element of that list, but not the accession number.

I can see that it is a list that's in each row, but how to parse that list and get that number and add it to a column is what's missing for me. It should be possible with Regex maybe? But I'm not sure how Regex works at all.

CodePudding user response：

You can use the .findall() function or .extract() to get the id.

Refer to : Use regular expression to extract elements from a pandas data frame

CodePudding user response：

You can try this:

targets_list['target_components'].map(lambda x: x[0]["accession"])

CodePudding user response：

First there is no need to use pd.DataFrame again to create dataframe from existing columns:

targets_list = targets_df[['target_components', 'target_chembl_id']]

Then you can use apply to access the column element

tgt = targets_list['target_components'].apply(lambda x: x[0]['accession'])