Hi I have single large string and i need to search set of string from this string and get that row create a data frame with this rows.
large String:
This is democracy’s day.
A day of history and hope.
Of renewal and resolve.
Through a crucible for the ages America has been tested anew and America has risen to the challenge.
Today, we celebrate the triumph not of a candidate, but of a cause, the cause of democracy.
The will of the people has been heard and the will of the people has been heeded.
We have learned again that democracy is precious.
Now i want to search few set of strings from above. and my final output dataframe should look like below
Searching string
democracy’s day
America has been tested
celebrate the triumph
democracy is precious
Thanks in advance
CodePudding user response:
You can create a regex out of your search strings and compare them for a match against the Large String
column using extract
. Where there's a match, the match string will be the value in the Searching String
column, otherwise it will be null. The dataframe can then be filtered on the Searching String
value being not null:
import re
df = pd.DataFrame({ 'Large String': ["This is democracy's day.", "A day of history and hope.","Of renewal and resolve.","Through a crucible for the ages America has been tested anew and America has risen to the challenge.","Today, we celebrate the triumph not of a candidate, but of a cause, the cause of democracy.","The will of the people has been heard and the will of the people has been heeded.","We have learned again that democracy is precious."] })
search_strings = ["democracy's day", "America has been tested", "celebrate the triumph", "democracy is precious"]
regex = '|'.join(map(re.escape, search_strings))
df['Searching String'] = df['Large String'].str.extract(f'({regex})')
df = df[~df['Searching String'].isna()]
print(df)
Output:
Large String Searching String
0 This is democracy's day. democracy's day
3 Through a crucible for the ages America has be... America has been tested
4 Today, we celebrate the triumph not of a candi... celebrate the triumph
6 We have learned again that democracy is precious. democracy is precious
Note:
- we use
re.escape
on the search strings in case they contain special characters for regex e.g..
or(
etc. - if one of the search strings is a subset of another, the list should be sorted by order of decreasing length to ensure the longer matches are captured