How to split on many different delimiters when assigning to dictionary-CodePudding

For the sake of practicing how to become more fluent using dictionaries, I trying to write a program that reads the chemical composition of the lunar atmosphere and assign the elements and their estimated composition as a key-value pair like this "NEON 20":40000

The data file looks like this

Estimated Composition (night, particles per cubic cm):
Helium 4 - 40,000 ; Neon 20 - 40,000 ; Hydrogen - 35,000
Argon 40 - 30,000 ; Neon 22 - 5,000 ; Argon 36 - 2,000
Methane - 1000 ; Ammonia - 1000 ; Carbon Dioxide - 1000

And my code so far looks like this:

def read_data(filename):
    dicti = {}

    with open(filename,"r") as infile:
        infile.readline()

        for line in infile:
            words = line.split(";")
            dicti[words[0]] = f"{words[1]}"

    for key in dicti:
        print(key, dicti[key])

read_data("atm_moon.txt")

My question is:

How do I split on both "-" and ";"?
How do I assign the elements and their estimated atmospheric composition as a key-value pair in a simple and elegant way from this data file?
How do I make the element names all upper case?

Is there anyone who is kind enough to help a rookie out? All help is welcomed.

CodePudding user response：

What you have here is a list of lines. Each line can contain multiple items, separated by semicolons. Each item (or record) consists of an element name, a hyphen, and the particle count.

You don't need to split on different delimiters at the same time here; instead, you can split out the individual items using the semicolons, and then split each item into the key/value pair you need for your dictionary based on the hyphen.

for line in infile:
    for item in line.split(" ; "):
        key, value = item.split(" - ", 1)
        dicti[key.upper()] = value

Note that I'm including the spaces around your delimiters, so they are removed when you split. Otherwise those will end up in your dictionary. An alternative would be to use strip(); that way it works properly even if there are more (or no) spaces there.

for line in infile:
    for item in line.split(";"):
        key, value = item.split("-", 1)
        dicti[key.strip().upper()] = value.strip()

However, if there's any chance that one of your records might have a semicolon or a hyphen in it that's not meant to be a separator, I'd leave the spaces in the .split() call.

Now I'm going to go a step further and assume that you will want those values as actual numbers, not just strings. To do this we'll remove the commas and convert them to integers.

for line in infile:
    for item in line.split(";"):
        key, value = item.split("-", 1)
        dicti[key.strip().upper()] = int(value.strip().replace(",", ""))

If there were any values with fractional parts (decimal points), you could use float() in place of int() to convert those to floating-point numbers.

CodePudding user response：

I feel like it's easier to use the Python REPL to test this out.

$ python

>>> string = """\
Estimated Composition (night, particles per cubic cm):
Helium 4 - 40,000 ; Neon 20 - 40,000 ; Hydrogen - 35,000
Argon 40 - 30,000 ; Neon 22 - 5,000 ; Argon 36 - 2,000
Methane - 1000 ; Ammonia - 1000 ; Carbon Dioxide - 1000\
"""
>>> lines = string.split('\n')
>>> lines
['Estimated Composition (night, particles per cubic cm):', 'Helium 4 - 40,000 ; Neon 20 - 40,000 ; Hydrogen - 35,000', 'Argon 40 - 30,000 ; Neon 22 - 5,000 ; Argon 36 - 2,000', 'Methane - 1000 ; Ammonia - 1000 ; Carbon Dioxide - 1000']
>>> lines[1:]
['Helium 4 - 40,000 ; Neon 20 - 40,000 ; Hydrogen - 35,000', 'Argon 40 - 30,000 ; Neon 22 - 5,000 ; Argon 36 - 2,000', 'Methane - 1000 ; Ammonia - 1000 ; Carbon Dioxide - 1000']
>>> [line.split(' ; ') for line in lines[1:]]
[['Helium 4 - 40,000', 'Neon 20 - 40,000', 'Hydrogen - 35,000'], ['Argon 40 - 30,000', 'Neon 22 - 5,000', 'Argon 36 - 2,000'], ['Methane - 1000', 'Ammonia - 1000', 'Carbon Dioxide - 1000']]
>>> [line.split(' - ') for line in lines[1:] for line in line.split(' ; ')]
[['Helium 4', '40,000'], ['Neon 20', '40,000'], ['Hydrogen', '35,000'], ['Argon 40', '30,000'], ['Neon 22', '5,000'], ['Argon 36', '2,000'], ['Methane', '1000'], ['Ammonia', '1000'], ['Carbon Dioxide', '1000']]\

Finally, creating a dictionary object with the desired mapping:

>>> dict([line.split(' - ') for line in lines[1:] for line in line.split(' ; ')])
{'Helium 4': '40,000', 'Neon 20': '40,000', 'Hydrogen': '35,000', 'Argon 40': '30,000', 'Neon 22': '5,000', 'Argon 36': '2,000', 'Methane': '1000', 'Ammonia': '1000', 'Carbon Dioxide': '1000'}

To transform the key-value pairs as desired (i.e. by uppercasing all keys in the dictionary and converting all values to int), you can use a helper function to transform the key-value pairs as shown below:

>>> transform = lambda x, y: (x.upper(), int(y.replace(',', '')))
>>> dict([transform(*name_line.split(' - ')) for line in lines[1:] for name_line in line.split(' ; ')])
{'HELIUM 4': 40000, 'NEON 20': 40000, 'HYDROGEN': 35000, 'ARGON 40': 30000, 'NEON 22': 5000, 'ARGON 36': 2000, 'METHANE': 1000, 'AMMONIA': 1000, 'CARBON DIOXIDE': 1000}

CodePudding user response：

To split by multiple delimiters, there's regex: Split Strings into words with multiple word boundary delimiters

Or you can first replace all delimiters to only keep one type, and then use .split()

I'm not sure how you want to make a dictionary out of that, but you can always just loop through the newly created list, and assign it one by one (or use a generator). If you include what the dictionary should look like, I can provide an example.

To change string into uppercase, use:

"abc".upper()
# ABC