Here is my dataset:
site_key site_det site_xls
169 [{'id': 'XYTTR_23412', 'det': '49', 'person': 'Aron'}] AMB_129
124 [{'id': 'XYTTR_23699', 'det': '42', 'person': 'Mike'}] AMB_239
149 [{'id': 'XYTTR_26455', 'det': '47', 'person': 'Ross'}] AMB_126
I want only these values in my final dataset:
site_key site_det site_xls
169 23412 129
124 23699 239
149 26455 126
I tried with regex but it didn't worked... How can I get this values in my dataset
CodePudding user response:
One solution could be:
import pandas as pd
data = {'site_key': {0: 169, 1: 124, 2: 149},
'site_det': {0: [{'id': 'XYTTR_23412', 'det': '49', 'person': 'Aron'}],
1: [{'id': 'XYTTR_23699', 'det': '42', 'person': 'Mike'}],
2: [{'id': 'XYTTR_26455', 'det': '47', 'person': 'Ross'}]},
'site_xls': {0: 'AMB_129', 1: 'AMB_239', 2: 'AMB_126'}}
df = pd.DataFrame(data)
df['site_det'] = df.site_det.explode().apply(pd.Series)['id']\
.str.extract(r'(\d $)').astype(int)
df['site_xls'] = df.site_xls.str.extract(r'(\d $)').astype(int)
print(df)
site_key site_det site_xls
0 169 23412 129
1 124 23699 239
2 149 26455 126
Explanation:
df.explode
will "[t]ransform each element of a list-like to a row, replicating index values". I.e. it will get rid of the list surrounding thedicts
in this case..apply(pd.Series)
will next turn the dicts into adf
with the keys as columns. From thisdf
we only need colid
.- Next, we can use
Series.str.extract
to extract only the last part of the string.r'(\d $)'
meaning: capture all (one or more) digits (\d
) at the end ($
) of a string. - For your other col (
site_xls
) we need only the last step.
For an alternative to the last step, see the answer
by @dnyll. However, if your strings can contain '_'
multiple times, you'd want to use .str.rsplit('_', n=1, expand=True).iloc[:,-1]
.
Finally, if you run into the error: ValueError: cannot convert float NaN to integer
, this will mean that one of your strings does not in fact end with one or more digits. The error in this case would result from trying to apply astype(int)
. To be on the safe side, you could change it to astype(float)
.
CodePudding user response:
Without regex:
df['site_det'] = df.site_det.explode().apply(pd.Series)['id'].str.split('_', expand=True)[1]
df['site_xls'] = df['site_xls'].str.split('_', expand=True)[1]
Output:
site_key site_det site_xls
0 169 23412 129
1 124 23699 239
2 149 26455 126