Home > Blockchain >  parsing strings for specific elements - python
parsing strings for specific elements - python

Time:02-16

I have a pandas data-frame that contains a column of sentences with pattern: row 1 of column : "ID is 123 or ID is 234 or ID is 345" row 2 of column : "ID is 123 or ID is 567 or ID is 876" row 3 of column : "ID is 567 or ID is 567 or ID is 298".

My aim is to extract the numbers in each row and save them in a list or numpy array. Since there is a pattern (the number always comes after "ID is", I thought that regex might be the best way to go for it (but I am not sure how to use regex for multiple extractions in 1 string.

Any advice?

CodePudding user response:

Standard module re can use '\d '

re.findall('\d ', "ID is 123 or ID is 234 or ID is 345")

to get list [123,234,345].

To make sure you can also use 'ID is (\d )'

re.findall('ID is (\d )', "ID is 123 or ID is 234 or ID is 345")

In DataFrame you can use .str.findall() to do the same for all rows.

import pandas as pd


df = pd.DataFrame({
  'ID': [
    "ID is 123 or ID is 234 or ID is 345",
    "ID is 123 or ID is 567 or ID is 876",
    "ID is 567 or ID is 567 or ID is 298",
  ]
})

print('\n--- before ---\n')
print(df)
 
df['result'] = df['ID'].str.findall('ID is (\d )')

print('\n--- after ---\n')
print(df)

Result:

--- before ---

                                    ID
0  ID is 123 or ID is 234 or ID is 345
1  ID is 123 or ID is 567 or ID is 876
2  ID is 567 or ID is 567 or ID is 298

--- after ---

                                    ID           result
0  ID is 123 or ID is 234 or ID is 345  [123, 234, 345]
1  ID is 123 or ID is 567 or ID is 876  [123, 567, 876]
2  ID is 567 or ID is 567 or ID is 298  [567, 567, 298]

If you need only column result as numpy array then you can get df['result'].values.

And if you need as nested list: df['result'].values.tolist().

  • Related