Complex partial string matching in pandas-CodePudding

Given a dataframe with the following structure and values json_path -

json_path	Reporting Group	Entity/Grouping
data.attributes.total.children.[0]	Christian Family	Abraham Family
data.attributes.total.children.[0].children.[0]	Christian Family	In Estate
data.attributes.total.children.[0].children.[0].children.[0].children.[0]	Christian Family	Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]	Christian Family	Investment Grade Fixed Income

How would I filter on the json_path rows which containchildren four times? i.e., I want to filter on index position 2-3 -

json_path	Reporting Group	Entity/Grouping
data.attributes.total.children.[0].children.[0].children.[0].children.[0]	Christian Family	Cash
data.attributes.total.children.[0].children.[0].children.[1].children.[0]	Christian Family	Investment Grade Fixed Income

I know how to obtain a partial match, however the integers in the square brackets will be inconsistent, so my instinct is telling me to somehow have logic that counts the instances of children (i.e., children appearing 4x) and using that as a basis to filter.

Any suggestions or resources on how I can achieve this?

CodePudding user response：

As you said, a naive approach would be to count the occurrence of .children and compare the count with 4 to create boolean mask which can be used to filter the rows

df[df['json_path'].str.count(r'\.children').eq(4)]

A more robust approach would be to check for the consecutive occurrence of 4 children

df[df['json_path'].str.contains(r'(\.children\.\[\d \]){4}')]

                                                                   json_path   Reporting Group                Entity/Grouping
2  data.attributes.total.children.[0].children.[0].children.[0].children.[0]  Christian Family                           Cash
3  data.attributes.total.children.[0].children.[0].children.[1].children.[0]  Christian Family  Investment Grade Fixed Income