How can lists be distinguished depending on the types of their items?-CodePudding

I have converted some XML files with xmltodict to native Python types (so it "feel[s] like [I am] working with JSON"). The converted objects have a lot of "P" keys with values that might be one of:

a list of strings
a list of None and a string.
a list of dicts
a list of lists of dicts

If the list contain only strings or if the list contain only strings and None, then it should be converted to a string using join. If the list contain dicts or lists then it should be skipped without processing.

How can the code tell these cases apart, so as to determine which action should be performed?

Example data for the first two cases, which should be joined:

["Bla Bla"]
[null,"Bla bla"]

Example data for the last two cases, which should be skipped:

[{"CPV_CODE":{"CODE":79540000}}]
[
  [{"CPV_CODE":{"CODE":79530000}}, {"CPV_CODE":{"CODE":79540000}}],
  [{"CPV_CODE":{"CODE":79550000}}]
]

This is done in a function that processes the data:

def recursive_iter(obj):
    if isinstance(obj, dict):
        for item in obj.values():
            if "P" in obj and isinstance(obj["P"], list) and not isinstance(obj["P"], dict):
                #need to add a check for not dict and list in list
                obj["P"] = " ".join([str(e) for e in obj["P"]])
            else:
                yield from recursive_iter(item)
    elif any(isinstance(obj, t) for t in (list, tuple)):
        for item in obj:
            yield from recursive_iter(item)
    else:
        yield obj

CodePudding user response：

Since you want to find the list with strings

if ("P" in obj) and isinstance(obj["P"], list):
    if all([isinstance(z, str) for z in obj["P"]]):
        ...  # keep list with strings

is it what you want?

CodePudding user response：

Let's start with the two conditions:

the list contains only strings or the list contains only strings and null / none
the list contains dict or list(s)

The first subcondition of the first condition is covered by the second subcondition, so #1 can be simplified to:

the list contains only strings or None

Now let's rephrase them in something resembling a first order logic:

All the list items are None or strings.
Some list item is a dict or a list.

The way that condition #2 is written, it could use an "All" quantifier, but in the context of the operation (whether or not to join the list items), a "some" is appropriate, and more closely aligns with the negation of condition 1 ("Some list item is not None or a string"). Also, it allows for an illustration of another implementation (shown below).

These two conditions are mutually exclusive, though not necessarily exhaustive. To simplify matters, let's assume that, in practice, these are the only two possibilities. Leaving aside the quantifiers ("All", "Some"), these are easily translatable into generator expressions:

(None == item or isinstance(item, str) for item in items)
(isinstance(item, (dict, list)) for item in items)

Note that isinstance accepts a tuple of types (which basically functions as a union type) for the second argument, allowing multiple types to be checked in one call. You could make use of this to combine the two tests into one by using NoneType (isinstance(item, (str, types.NoneType)) or isinstance(item, (str, type(None)))), but this doesn't gain you much of anything.

The "All" and "Some" quantifiers are expressed as the all and any functions, which take iterables (such as what is produced by generator expressions):

all(item is None or isinstance(item, str) for item in items)
any(isinstance(item, (dict, list)) for item in items)

Abstracting these expressions into functions gives two options for the implementation. From recursive_iter, it looks like the value for a "P" might not always be a list. To guard against this, a isinstance(items, list) condition is included:

# 1
def shouldJoin(items):
    return isinstance(items, list) and all([item is None or isinstance(item, str) for item in items])

# 2
def shouldJoin(items):
    return isinstance(items, list) and not any([isinstance(item, (dict, list)) for item in items])

If you want a more general version of condition #2, you can use container abstract base classes:

import collections.abc as abc

def shouldJoin(items):
    return isinstance(items, list) and not any(isinstance(item, (abc.Mapping, abc.MutableSequence)) for item in items)

Both str and list share many abstract base classes; MutableSequence is the one that is unique to list, so that is what's used in the sample. To see exactly which ABCs each concrete type descends from, you can play around with the following:

import collections.abc as abc
ABCMeta = type(abc.Sequence)
abcs = {name: val for (name, val) in abc.__dict__.items() if isinstance(val, ABCMeta)}

def abcsOf(t):
    return {name for (name, kls) in abcs.items() if issubclass(t, kls)}

# examine ABCs
abcsOf(str)
abcsOf(list)
# which ABCs does list descend from, that str doesn't?
abcsOf(list) - abcsOf(str)
# result: {'MutableSequence'}
abcsOf(tuple) - abcsOf(str)
# result: set() (the empty set)

Note that it's not possible to distinguish strs from tuples using just ABCs.

Other Notes

The expression any(isinstance(obj, t) for t in (list, tuple)) can be simplified to isinstance(obj, (list, tuple)).

Dict Loop Bug

All the references to "P" in obj and obj["P"] in the first for loop of recursive_iter are loop-invariant. This means, in general and at the very least, there's an opportunity for loop optimization. However, in this case since the branch tests an item other than the current item, it indicates a bug. How it should be fixed depends on whether or not the joined string should be yielded. If so, the test & join can be moved outside the loop, and the loop will then yield the modified value of "P":

    # ...
        if "P" in obj and shouldJoin(obj["P"]):
            obj["P"] = " ".join([str(item) for item in obj["P"]])
        for value in obj.values():
            yield from recursive_iter(value)
    #...

If not, there are a couple options (note you can use dict.items() to get the keys & their values simultaneously):

Move the test & join outside the loop (as for if the joined "P" should be yielded), but skip the modified value for "P" within the loop:

# ...
if "P" in obj and shouldJoin(obj["P"]):
    obj["P"] = " ".join([str(item) for item in obj["P"]])
for (key, value) in obj.items():
    if not ("P" == key and isinstance(value, str)):
        yield from recursive_iter(item)

Move the test & join outside the loop (as for if the joined "P" should be yielded), but exclude "P" from the loop:

# ...
values = (value for value in obj.values())
if "P" in obj and shouldJoin(obj["P"]):
    obj["P"] = " ".join([str(item) for item in obj["P"]])
    values = (value for (key, value) in obj.items() if "P" != key)
for value in values:
    yield from recursive_iter(value)

Keep the test & join in the loop, but test the current key. In general, you need to be careful about modifying objects while looping over them as that may invalidate or interfere with iterators. In this particular case, dict.items returns a view object, so modifying values shouldn't cause problems (though adding or removing values will cause a runtime error).
```
# ...
for (key, value) in obj.items():
    if "P" == key and shouldJoin(value):
        obj["P"] = " ".join([str(item) for item in value])
    else:
        yield from recursive_iter(item)
```