Regex ta captures decimal number with comma-CodePudding

I have the a part of a list as following:

energy = [... , 'Wind 1,00 Water 7,36 Renewable Cogeneration 9,71 Other Renewables 9,65 Solid Waste 2,80',
 'Fossil Cogeneration 2,17 ',
 'Natural Gas 56,35 Coal 9,73 Nuclear 0,64 ', ...]

where the rest of the list has nothing to do with this part I need o turn it into a dataframe as such:

                        key   value
0                      Wind    1.00          
1                     Water    7.36         
2    Renewable Cogeneration    9.71          
3          Other Renewables    9.65      
4               Solid Waste    2.80      
5       Fossil Cogeneration    2.17
6               Natural Gas   56.35
7                      Coal    9.73
8                   Nuclear    0.64

I cannot directly concatenate the strings because this is a programme that can generates these strings in many formats. Can be 1 to 4 strings in the list and there is no defined number of keys in one string, they can be kinda randomly assigned to one string.

I have the following piece of code to generate another string that will later be used to create the dataframe:

import re

table_energy = []

to_find = ['Wind', 'Water', 'Renewable Cogeneration', 'Other Renewable', 'Solid Waste', 'Fossil Cogeneration', 'Natural Gas', 'Coal', 'Nuclear']
chars_to_find = ['Wind', 'Water', 'Renewable Cogeneration', 'Renewable', 'Waste', 'Fossil Cogeneration', 'Gas', 'Coal', 'Nuclear']
key = ['Wind', 'Water', 'Ren_Cog', 'Other_Ren', 'Sol_Waste', 'Fossil_Cog', 'Nat_Gas', 'Coal', 'Nuclear']

for i in range(len(energy))

    for j in range(len(to_find)):

        if to_find[j] in energy[i]:

           aux_row= energy[i] ' '
           expression = chars_to_find[j] ' (.*) '
           result = re.search(expression, aux_row)
           print(result.group(1))
           table_energy.append('key_' key[j] ' ' result.group(1))

This is the output:

['key_Wind 1,00 Water 7,36 Renewable Cogeneration 9,71 Other Renewable 9,65 Solid Waste 2,80',  
'key_Water 7,36 Renewable Cogeneration 9,71 Other Renewable 9,65 Solid Waste 2,80',  
'key_Ren_Cog 9,71 Other Renewable 9,65 Solid Waste 2,80',  
'key_Other_Ren 9,65 Solid Waste 2,80',  
'key_Sol_Waste 2,80',  
'key_Fossil_Cog 2,17',  
'key_Nat_Gas 56,35 Coal 9,73 Nuclear 0,64',  
'key_Coal 9,73 Nuclear 0,64',  
'key_Nuclear 0,64']

I needed that each element would consist only of key and value. I understand that this is a problem with the regex expression but I tried many ways and it does not work.

Any help is appreciated, thanks :)

CodePudding user response：

import re
import pandas as pd

data = []
for i in energy:
    # Splits data based on the space after the value.
    key_value_pair = re.split(r'(?<=\d)\s', i)

    for key_value in key_value_pair:
        # Splits key and value based on the space before the value.
        # Ads list of key and value to the result list.
        data.append(re.split(r'(?:\s)(?=\d)', key_value))

# Drops None values caused by the space in the end of the examples 
# such as 'Fossil Cogeneration 2,17 '. 
pd.DataFrame(data, columns=['key', 'value']).dropna()

	key	value
0	Wind	1,00
1	Water	7,36
2	Renewable Cogeneration	9,71
3	Other Renewables	9,65
4	Solid Waste	2,80
5	Fossil Cogeneration	2,17
7	Natural Gas	56,35
8	Coal	9,73
9	Nuclear	0,64

CodePudding user response：

You can use a pattern with 2 capture group and re.findall that will return a list of tuples with 2 values (the 2 capture group values)

Then you can append the tuples to table_energy which you can then give to the Dataframe.

(\S.*?)\s (\d ,\d )

The pattern matches:

(\S.*?) Capture group 1, match a single non whitespace chars followed by any character, as few as possible
\s Match 1 whitspace cars
(\d ,\d ) Capture group 2, match 1 digits , and 1 digits

Regex demo | Python demo

Example

import re
import pandas as pd

energy = ['Wind 1,00 Water 7,36 Renewable Cogeneration 9,71 Other Renewables 9,65 Solid Waste 2,80',
          'Fossil Cogeneration 2,17 ',
          'Natural Gas 56,35 Coal 9,73 Nuclear 0,64']
table_energy = []
pattern = r"(\S.*?)\s (\d ,\d )"
for i in energy:
    for tup in re.findall(pattern, i):
        table_energy.append(tup)

res = pd.DataFrame(table_energy, columns=['key', 'value'])

print(res)

Output

                      key  value
0                    Wind   1,00
1                   Water   7,36
2  Renewable Cogeneration   9,71
3        Other Renewables   9,65
4             Solid Waste   2,80
5     Fossil Cogeneration   2,17
6             Natural Gas  56,35
7                    Coal   9,73
8                 Nuclear   0,64