retrieve cell string values in a column between two unknown indexes based on substrings location-CodePudding

I need to locate the first location where the word 'then' appears on Words table. I'm trying to get a code to consolidate all strings on 'text' column from this location till the first text with a substring '666' or '999' in it (in this case a combination of their, stoma22, fe156, sligh334, pain666 (the desired subtrings_output = 'theirfe156sligh334pain666'). I've tried:

their_loc = np.where(words['text'].str.contains(r'their', na =True))[0][0]
666_999_loc = np.where(words['text'].str.contains(r'666', na =True))[0][0]
subtrings_output = Words['text'].loc[Words.index[their_loc:666_999_loc]]

as you can see I'm not sure how to extend the conditioning of 666_999_loc to include substring 666 or 999, also slicing the indexing between two variables renders an error. Many thanks

Words table:

page no	text	font
1	they	0
1	ate	0
1	apples	0
2	and	0
2	then	1
2	their	0
2	stoma22	0
2	fe156	1
2	sligh334	0
2	pain666	1
2	given	0
2	the	1
3	fruit	0

CodePudding user response：

You just need to add one for the end of the slice, and add an or condition to the np.where of the 666_or_999_loc using the | operator.

text_col = words['text']

their_loc = np.where(text_col.str.contains(r'their', na=True))[0][0]

contains_666_or_999_loc = np.where(text_col.str.contains('666', na=True) |
                                   text_col.str.contains('999', na=True))[0][0]

subtrings_output = ''.join(text_col.loc[words.index[their_loc:contains_666_or_999_loc   1]])

print(subtrings_output)

Output:

theirstoma22fe156sligh334pain666

CodePudding user response：

IIUC, use pandas.Series.idxmax with "".join().

Series.idxmax(axis=0, skipna=True, *args, **kwargs)
Return the row label of the maximum value. If multiple values equal the maximum, the first row label with that value is returned.

So, assuming (Words) is your dataframe, try this :

their_loc = Words["text"].str.contains("their").idxmax()

_666_999_loc = Words["text"].str.contains("666").idxmax()

subtrings_output = "".join(Words["text"].loc[Words.index[their_loc:_666_999_loc 1]])

Output :

print(subtrings_output)
#theirstoma22fe156sligh334pain666

#their stoma22 fe156 sligh334 pain666 # <- with " ".join()