Extract value from anomalous dictionary-CodePudding

I have a set of strings formatted as BBB below, and I need to extract the value corresponding to the text key (in the example below it's "My life is amazing").

BBB = str({"id": "18976", "episode_done": False, "text": "My life is amazing", 
    "text_candidates": ["My life is amazing", "I am worried about global warming"], 
    "metrics": {"clen": AverageMetric(12), "ctrunc": AverageMetric(0), 
    "ctrunclen": AverageMetric(0)}})

I tried converting BBB into a string and then into a dictionary using json.load and ast.literal_eval, but I get error messages in both cases. I suppose this is due to the fact that the metrics key has a dictionary as a value.

How do you suggest to solve the issue? Thanks.

CodePudding user response：

Having a dictionary as a value is not the problem, that's just called nested dictionaries, which is perfectly fine.

I'm not sure what your initial data (and its type) is, but here's a demo of using your dictionary. Supposing you have a dictionary

BBB_dict = {
    "id": "18976",
    "episode_done": False,
    "text": "My life is amazing", 
    "text_candidates": ["My life is amazing", "I am worried about global warming"], 
    "metrics": {
        "clen": AverageMetric(12),
        "ctrunc": AverageMetric(0), 
        "ctrunclen": AverageMetric(0)
    }
}

You have to note that calling str(BBB_dict) does not create a JSON string. (related). To convert such a dictionary to a JSON string, you could do something like:

BBB = json.dumps(BBB_dict)

But this would probably raise following exception for you:

TypeError: Object of type AverageMetric is not JSON serializable

Well, that's because Python does not know which attributes of your AverageMetric class to use when creating a JSON from it. So, you have to

def serialize(obj):
    if isinstance(obj, AverageMetric):
        return {
            'x': obj.x,
            'y': obj.y,
            'z': obj.z
        }
    
    return {}

This method specifies what fields to use when creating a JSON (i.e. serializing an AverageMetrics object). (related) so you could create your JSON string as follows:

BBB = json.dumps(BBB_dict, default=serialize)

Which would result in the following:

'{"id": "18976", "episode_done": false, "text": "My life is amazing", "text_candidates": ["My life is amazing", "I am worried about global warming"], "metrics": {"clen": {"x": 12, "y": 1, "z": "z"}, "ctrunc": {"x": 0, "y": 1, "z": "z"}, "ctrunclen": {"x": 0, "y": 1, "z": "z"}}}'

CodePudding user response：

You could adapt the source of ast.literal_eval() to something that parses function calls (and other non-literals), but into strings:

import ast

BBB = """
{"id": "18976", "episode_done": False, "text": "My life is amazing", 
    "text_candidates": ["My life is amazing", "I am worried about global warming"], 
    "metrics": {"clen": AverageMetric(12), "ctrunc": AverageMetric(0), 
    "ctrunclen": AverageMetric(0)}}
""".strip()


def literal_eval_with_function_calls(source):
    # Adapted from `ast.literal_eval`
    def _convert(node):
        if isinstance(node, list):
            return [_convert(arg) for arg in node]
        if isinstance(node, ast.Constant):
            return node.value
        if isinstance(node, ast.Tuple):
            return tuple(map(_convert, node.elts))
        if isinstance(node, ast.List):
            return list(map(_convert, node.elts))
        if isinstance(node, ast.Set):
            return set(map(_convert, node.elts))
        if isinstance(node, ast.Call) and isinstance(node.func, ast.Name) and node.func.id == 'set' and node.args == node.keywords == []:
            return set()
        if isinstance(node, ast.Dict):
            return dict(zip(map(_convert, node.keys), map(_convert, node.values)))
        if isinstance(node, ast.Expression):
            return _convert(node.body)
        return {
            f'${node.__class__.__name__}': ast.get_source_segment(source, node),
        }

    return _convert(ast.parse(source, mode='eval'))


print(literal_eval_with_function_calls(BBB))

This outputs

{'episode_done': False,
 'id': '18976',
 'metrics': {'clen': {'$Call': 'AverageMetric(12)'},
             'ctrunc': {'$Call': 'AverageMetric(0)'},
             'ctrunclen': {'$Call': 'AverageMetric(0)'}},
 'text': 'My life is amazing',
 'text_candidates': ['My life is amazing', 'I am worried about global warming']}

However, it would be better to just have data that's not in a non-parseable format to begin with...

CodePudding user response：

You can use a regex:

import re

>>> re.findall('(?<="text": )"(.*)"', BBB)[0]
'My life is amazing'