Unable to get values from Pandas Column-CodePudding

Here is my dataset:

site_key                         site_det                             site_xls
169       [{'id': 'XYTTR_23412', 'det': '49', 'person': 'Aron'}]      AMB_129
124       [{'id': 'XYTTR_23699', 'det': '42', 'person': 'Mike'}]      AMB_239
149       [{'id': 'XYTTR_26455', 'det': '47', 'person': 'Ross'}]      AMB_126

I want only these values in my final dataset:

site_key        site_det     site_xls
169             23412        129
124             23699        239
149             26455        126

I tried with regex but it didn't worked... How can I get this values in my dataset

CodePudding user response：

One solution could be:

import pandas as pd

data = {'site_key': {0: 169, 1: 124, 2: 149}, 
        'site_det': {0: [{'id': 'XYTTR_23412', 'det': '49', 'person': 'Aron'}], 
                     1: [{'id': 'XYTTR_23699', 'det': '42', 'person': 'Mike'}], 
                     2: [{'id': 'XYTTR_26455', 'det': '47', 'person': 'Ross'}]}, 
        'site_xls': {0: 'AMB_129', 1: 'AMB_239', 2: 'AMB_126'}}

df = pd.DataFrame(data)

df['site_det'] = df.site_det.explode().apply(pd.Series)['id']\
    .str.extract(r'(\d $)').astype(int)
df['site_xls'] = df.site_xls.str.extract(r'(\d $)').astype(int)

print(df)

   site_key  site_det  site_xls
0       169     23412       129
1       124     23699       239
2       149     26455       126

Explanation:

df.explode will "[t]ransform each element of a list-like to a row, replicating index values". I.e. it will get rid of the list surrounding the dicts in this case.
.apply(pd.Series) will next turn the dicts into a df with the keys as columns. From this df we only need col id.
Next, we can use Series.str.extract to extract only the last part of the string. r'(\d $)' meaning: capture all (one or more) digits (\d ) at the end ($) of a string.
For your other col (site_xls) we need only the last step.

For an alternative to the last step, see the answer by @dnyll. However, if your strings can contain '_' multiple times, you'd want to use .str.rsplit('_', n=1, expand=True).iloc[:,-1].

Finally, if you run into the error: ValueError: cannot convert float NaN to integer, this will mean that one of your strings does not in fact end with one or more digits. The error in this case would result from trying to apply astype(int). To be on the safe side, you could change it to astype(float).

CodePudding user response：

Without regex:

df['site_det'] = df.site_det.explode().apply(pd.Series)['id'].str.split('_', expand=True)[1]
df['site_xls'] = df['site_xls'].str.split('_', expand=True)[1]

Output:

   site_key site_det site_xls
0       169    23412      129
1       124    23699      239
2       149    26455      126