I need to locate the first location where the word 'then' appears on Words table. I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666'). I've tried:
their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]
as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error. Many thanks
Words table:
page no | text | font |
---|---|---|
1 | they | 0 |
1 | ate | 0 |
1 | apples | 0 |
2 | and | 0 |
2 | then | 1 |
2 | their | 0 |
2 | stoma22 | 0 |
2 | fe156 | 1 |
2 | sligh334 | 0 |
2 | pain666 | 1 |
2 | given | 0 |
2 | the | 1 |
3 | fruit | 0 |
CodePudding user response:
You just need to add one for the end of the slice, and add an or
condition to the np.where
of the 666_or_999_loc
using the |
operator.
text_col = words['text']
their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]
contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
text_col.str.contains('999', na=True))[0][0]
subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc 1]])
print(subtrings_output)
Output:
theirstoma22fe156sligh334pain666
CodePudding user response:
IIUC, use pandas.Series.idxmax
with "".join()
.
Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value. If multiple values equal the maximum, the first row label with that value is returned.
So, assuming (Words
) is your dataframe, try this :
their_loc = Words["text"].str.contains("their").idxmax()
_666_999_loc = Words["text"].str.contains("666").idxmax()
subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc 1]])
Output :
print(subtrings_output)
#theirstoma22fe156sligh334pain666
#their stoma22 fe156 sligh334 pain666 # <- with " ".join()