I have the a part of a list as following:
energy = [... , 'Wind 1,00 Water 7,36 Renewable Cogeneration 9,71 Other Renewables 9,65 Solid Waste 2,80',
'Fossil Cogeneration 2,17 ',
'Natural Gas 56,35 Coal 9,73 Nuclear 0,64 ', ...]
where the rest of the list has nothing to do with this part I need o turn it into a dataframe as such:
key value
0 Wind 1.00
1 Water 7.36
2 Renewable Cogeneration 9.71
3 Other Renewables 9.65
4 Solid Waste 2.80
5 Fossil Cogeneration 2.17
6 Natural Gas 56.35
7 Coal 9.73
8 Nuclear 0.64
I cannot directly concatenate the strings because this is a programme that can generates these strings in many formats. Can be 1 to 4 strings in the list and there is no defined number of keys in one string, they can be kinda randomly assigned to one string.
I have the following piece of code to generate another string that will later be used to create the dataframe:
import re
table_energy = []
to_find = ['Wind', 'Water', 'Renewable Cogeneration', 'Other Renewable', 'Solid Waste', 'Fossil Cogeneration', 'Natural Gas', 'Coal', 'Nuclear']
chars_to_find = ['Wind', 'Water', 'Renewable Cogeneration', 'Renewable', 'Waste', 'Fossil Cogeneration', 'Gas', 'Coal', 'Nuclear']
key = ['Wind', 'Water', 'Ren_Cog', 'Other_Ren', 'Sol_Waste', 'Fossil_Cog', 'Nat_Gas', 'Coal', 'Nuclear']
for i in range(len(energy))
for j in range(len(to_find)):
if to_find[j] in energy[i]:
aux_row= energy[i] ' '
expression = chars_to_find[j] ' (.*) '
result = re.search(expression, aux_row)
print(result.group(1))
table_energy.append('key_' key[j] ' ' result.group(1))
This is the output:
['key_Wind 1,00 Water 7,36 Renewable Cogeneration 9,71 Other Renewable 9,65 Solid Waste 2,80',
'key_Water 7,36 Renewable Cogeneration 9,71 Other Renewable 9,65 Solid Waste 2,80',
'key_Ren_Cog 9,71 Other Renewable 9,65 Solid Waste 2,80',
'key_Other_Ren 9,65 Solid Waste 2,80',
'key_Sol_Waste 2,80',
'key_Fossil_Cog 2,17',
'key_Nat_Gas 56,35 Coal 9,73 Nuclear 0,64',
'key_Coal 9,73 Nuclear 0,64',
'key_Nuclear 0,64']
I needed that each element would consist only of key and value. I understand that this is a problem with the regex expression but I tried many ways and it does not work.
Any help is appreciated, thanks :)
CodePudding user response:
import re
import pandas as pd
data = []
for i in energy:
# Splits data based on the space after the value.
key_value_pair = re.split(r'(?<=\d)\s', i)
for key_value in key_value_pair:
# Splits key and value based on the space before the value.
# Ads list of key and value to the result list.
data.append(re.split(r'(?:\s)(?=\d)', key_value))
# Drops None values caused by the space in the end of the examples
# such as 'Fossil Cogeneration 2,17 '.
pd.DataFrame(data, columns=['key', 'value']).dropna()
key | value | |
---|---|---|
0 | Wind | 1,00 |
1 | Water | 7,36 |
2 | Renewable Cogeneration | 9,71 |
3 | Other Renewables | 9,65 |
4 | Solid Waste | 2,80 |
5 | Fossil Cogeneration | 2,17 |
7 | Natural Gas | 56,35 |
8 | Coal | 9,73 |
9 | Nuclear | 0,64 |
CodePudding user response:
You can use a pattern with 2 capture group and re.findall that will return a list of tuples with 2 values (the 2 capture group values)
Then you can append the tuples to table_energy
which you can then give to the Dataframe.
(\S.*?)\s (\d ,\d )
The pattern matches:
(\S.*?)
Capture group 1, match a single non whitespace chars followed by any character, as few as possible\s
Match 1 whitspace cars(\d ,\d )
Capture group 2, match 1 digits,
and 1 digits
Example
import re
import pandas as pd
energy = ['Wind 1,00 Water 7,36 Renewable Cogeneration 9,71 Other Renewables 9,65 Solid Waste 2,80',
'Fossil Cogeneration 2,17 ',
'Natural Gas 56,35 Coal 9,73 Nuclear 0,64']
table_energy = []
pattern = r"(\S.*?)\s (\d ,\d )"
for i in energy:
for tup in re.findall(pattern, i):
table_energy.append(tup)
res = pd.DataFrame(table_energy, columns=['key', 'value'])
print(res)
Output
key value
0 Wind 1,00
1 Water 7,36
2 Renewable Cogeneration 9,71
3 Other Renewables 9,65
4 Solid Waste 2,80
5 Fossil Cogeneration 2,17
6 Natural Gas 56,35
7 Coal 9,73
8 Nuclear 0,64