I'm trying to convert a Python dictionary of the following form:
{
"version": "3.1",
"nlu": [
{
"intent": "greet",
"examples": ["hi", "hello", "howdy"]
},
{
"intent": "goodbye",
"examples": ["goodbye", "bye", "see you later"]
}
]
}
to a YAML file of the following form (note the pipes preceding the value associated to each examples
key):
version: "3.1"
nlu:
- intent: greet
examples: |
- hi
- hello
- howdy
- intent: goodbye
examples: |
- goodbye
- bye
- see you later
Except for needing the pipes (because of Rasa's training data format specs), I'm familiar with how to accomplish this task using yaml.dump()
.
What's the most straightforward way to obtain the format I'm after?
EDIT: Converting the value of each examples
key to a string first yields a YAML file which is not at all reader-friendly, especially given that I have many intents comprising many hundreds of total example utterances.
version: '3.1'
nlu:
- intent: greet
examples: " - hi\n - hello\n - howdy\n"
- intent: goodbye
examples: " - goodbye\n - bye\n - see you later\n"
I understand that this multi-line format is what the pipe symbol accomplishes, but I'd like to convert it to something more palatable. Is that possible?
CodePudding user response:
You are asking for the examples
value to be represented in your YAML output as a multiline string using the block quote operator (|
).
In your Python data, examples
is a list of strings, not a multiline string:
{
"intent": "greet",
"examples": ["hi", "hello", "howdy"]
},
Of course a Python list will be represented as a YAML list.
If you want it rendered as a block literal value, you need to transform the Python value into a multi-line string ("examples": "- hi\n- hello\n -howdy"
), and then you need to configure the yaml
module to output strings using the block quote operator.
Something like this:
import yaml
data = {
"version": "3.1",
"nlu": [
{
"intent": "greet",
"examples": ["hi", "hello", "howdy"]
},
{
"intent": "goodbye",
"examples": ["goodbye", "bye", "see you later"]
}
]
}
def quoted_presenter(dumper, data):
if '\n' in data:
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='|')
else:
return dumper.represent_scalar('tag:yaml.org,2002:str', data, style='"')
yaml.add_representer(str, quoted_presenter)
for item in data['nlu']:
item['examples'] = yaml.safe_dump(item['examples'])
print(yaml.dump(data))
This will output:
"nlu":
- "examples": |-
- hi
- hello
- howdy
"intent": "greet"
- "examples": |-
- goodbye
- bye
- see you later
"intent": "goodbye"
"version": "3.1"
Yes, this quotes everything (keys as well as values), but that's about the limits of our granularity using the yaml
module. Without the custom representer, we would get instead:
nlu:
- examples: '- hi
- hello
- howdy'
intent: greet
- examples: '- goodbye
- bye
- see you later'
intent: goodbye
version: '3.1'
That's syntactically identical; just with different formatting.
It's possible that ruamel.yaml
provides more control over the output format.