I have a list
like this:
list1=['weather:monday="severe" weather:friday, xxx:sunday="calm" xxx:sunday="high severe", yyy:friday="rainy" yyy:saturday=']
what I want is to result in dataframe like this:
column1 column2 column3
weather Monday severe
weather Friday
xxx Sunday calm
xxx Sunday high severe
yyy Friday rainy
yyy Saturday
First, in the list
, I tried the following:
newlist2 = [word for line in list1 for word in line.split(':')]
newlist2
['weather',
'monday="severe" weather',
'friday, xxx',
'sunday="calm" xxx',
'sunday="high severe", yyy',
'friday="rainy" yyy',
'saturday=']
and
newlist3 = [word for line in newlist2 for word in line.split('=')]
newlist3
['weather',
'monday',
'"severe" weather',
'friday, xxx',
'sunday',
'"calm" xxx',
'sunday',
'"high severe", yyy',
'friday',
'"rainy" yyy',
'saturday',
'']
After that I convert the list into a dataframe
df=pd.Dataframe(newlist3)
However, the outcome is not the desired one.
Any ideas on how to reach my desired outcome?
CodePudding user response:
First, I would clean the data, something like:
cleaned = [x.replace('"', '').replace('high severe', 'high_severe') for x in list1]
Then do those steps you already did:
newlist2 = [word for line in cleaned for word in line.split(':')]
newlist3 = [word for line in newlist2 for word in line.split('=')]
And add a fourth step:
newlist4 = [word for line in newlist3 for word in line.split(' ')]
To make these steps more concise you could look into re.split
as shown here.
Which yields the correct list. You want this list to be in three chunks, one for each column. You could use a function like this:
def divide_chunks(l, n):
# looping till length l
for i in range(0, len(l), n):
yield l[i:i n]
pd.DataFrame(list(divide_chunks(newlist4,3)))
>>>
0 1 2
0 weather monday severe
1 weather friday harmful,
2 weather1 sunday calm
3 weather1 sunday high_severe,
4 weather2 friday rainy
5 weather2 saturday cloudy
CodePudding user response:
Here is a way you can do it:
I still don't know if you only have these few possible inputs for weather or if it can be anything. In case it is only these few, here's an updated version of it. You need to change xxx
and yyy
to the actual names in your data.
# split the string everytime before the word 'weather' appears, with filter(None,...) filter empty elements
lst = list(filter(None, [word.strip(' ,').replace('"','') for line in list1 for word in re.split(r"(?=weather|xxx|yyy)", line)]))
#prepare data to have 'data' as lists of lists, where each list represents a row
data = []
for elem in lst:
weather, other = elem.split(':')
if '=' in other:
day, forecast = other.split('=')
else:
day = other
forecast = ''
data.append([weather, day, forecast])
df = pd.DataFrame(data, columns= ['weather', 'day', 'forecast'])
print(df)
Output:
weather day forecast
0 weather monday severe
1 weather friday
2 xxx sunday calm
3 xxx sunday high severe
4 yyy friday rainy
5 yyy saturday