Given a dataframe with the following structure and values json_path
-
json_path | Reporting Group | Entity/Grouping |
---|---|---|
data.attributes.total.children.[0] | Christian Family | Abraham Family |
data.attributes.total.children.[0].children.[0] | Christian Family | In Estate |
data.attributes.total.children.[0].children.[0].children.[0].children.[0] | Christian Family | Cash |
data.attributes.total.children.[0].children.[0].children.[1].children.[0] | Christian Family | Investment Grade Fixed Income |
How would I filter on the json_path
rows which containchildren
four times? i.e., I want to filter on index position 2-3 -
json_path | Reporting Group | Entity/Grouping |
---|---|---|
data.attributes.total.children.[0].children.[0].children.[0].children.[0] | Christian Family | Cash |
data.attributes.total.children.[0].children.[0].children.[1].children.[0] | Christian Family | Investment Grade Fixed Income |
I know how to obtain a partial match, however the integers in the square brackets will be inconsistent, so my instinct is telling me to somehow have logic that counts the instances of children
(i.e., children
appearing 4x) and using that as a basis to filter.
Any suggestions or resources on how I can achieve this?
CodePudding user response:
As you said, a naive approach would be to count the occurrence of .children
and compare the count with 4 to create boolean mask which can be used to filter the rows
df[df['json_path'].str.count(r'\.children').eq(4)]
A more robust approach would be to check for the consecutive occurrence of 4 children
df[df['json_path'].str.contains(r'(\.children\.\[\d \]){4}')]
json_path Reporting Group Entity/Grouping
2 data.attributes.total.children.[0].children.[0].children.[0].children.[0] Christian Family Cash
3 data.attributes.total.children.[0].children.[0].children.[1].children.[0] Christian Family Investment Grade Fixed Income