split list content and transform into dataframe-CodePudding

I have a list like this:

list1=['weather:monday="severe" weather:friday, xxx:sunday="calm" xxx:sunday="high severe", yyy:friday="rainy" yyy:saturday=']

what I want is to result in dataframe like this:

column1   column2    column3
weather   Monday     severe
weather   Friday     
xxx       Sunday     calm
xxx       Sunday     high severe
yyy       Friday     rainy
yyy       Saturday

First, in the list, I tried the following:

newlist2 = [word for line in list1 for word in line.split(':')]
newlist2

['weather',
 'monday="severe" weather',
 'friday, xxx',
 'sunday="calm" xxx',
 'sunday="high severe", yyy',
 'friday="rainy" yyy',
 'saturday=']

and

newlist3 = [word for line in newlist2 for word in line.split('=')]
newlist3

['weather',
 'monday',
 '"severe" weather',
 'friday, xxx',
 'sunday',
 '"calm" xxx',
 'sunday',
 '"high severe", yyy',
 'friday',
 '"rainy" yyy',
 'saturday',
 '']

After that I convert the list into a dataframe

df=pd.Dataframe(newlist3)

However, the outcome is not the desired one.

Any ideas on how to reach my desired outcome?

CodePudding user response：

First, I would clean the data, something like:

cleaned = [x.replace('"', '').replace('high severe', 'high_severe') for x in list1]

Then do those steps you already did:

newlist2 = [word for line in cleaned for word in line.split(':')]
newlist3 = [word for line in newlist2 for word in line.split('=')]

And add a fourth step:

newlist4 = [word for line in newlist3 for word in line.split(' ')]

To make these steps more concise you could look into re.split as shown here.

Which yields the correct list. You want this list to be in three chunks, one for each column. You could use a function like this:

def divide_chunks(l, n):
     
    # looping till length l
    for i in range(0, len(l), n):
        yield l[i:i   n]



pd.DataFrame(list(divide_chunks(newlist4,3)))
>>>
 0         1             2
0   weather    monday        severe
1   weather    friday      harmful,
2  weather1    sunday          calm
3  weather1    sunday  high_severe,
4  weather2    friday         rainy
5  weather2  saturday        cloudy

CodePudding user response：

Here is a way you can do it: I still don't know if you only have these few possible inputs for weather or if it can be anything. In case it is only these few, here's an updated version of it. You need to change xxx and yyy to the actual names in your data.

# split the string everytime before the word 'weather' appears, with filter(None,...) filter empty elements
lst = list(filter(None, [word.strip(' ,').replace('"','') for line in list1 for word in re.split(r"(?=weather|xxx|yyy)", line)]))

#prepare data to have 'data' as lists of lists, where each list represents a row
data = []
for elem in lst:
    weather, other = elem.split(':')
    if '=' in other:
        day, forecast = other.split('=')
    else:
        day = other
        forecast = ''
    data.append([weather, day, forecast])

df = pd.DataFrame(data, columns= ['weather', 'day', 'forecast'])
print(df)

Output:

   weather       day     forecast
0  weather    monday       severe
1  weather    friday             
2      xxx    sunday         calm
3      xxx    sunday  high severe
4      yyy    friday        rainy
5      yyy  saturday