I have managed to pull a list from a data source. The list elements are formatted like this (note the first number is NOT the index):
0 cheese 100
1 cheddar cheese 1100
2 gorgonzola 1300
3 smoked cheese 200
etc.
This means when printed, one line contains "0 cheese 100
", with all the spaces.
What I would like to do is parse each entry to divide it into two lists. I don't need the first number. Instead, I want the cheese type and the number after.
For instance:
cheese
cheddar cheese
gorgonzola
smoked cheese
and:
100
1100
1300
200
The ultimate goal is to be able to attribute the two lists to columns in a pd.DataFrame so they can be processed in their own individual way.
Any help is much appreciated.
CodePudding user response:
If the goal is a dataframe, why not just make that rather than the two lists. If you turn your string into a Series, you can us pandas.Series.str.extract()
to split it into the columns you want:
import re
import pandas as pd
s = '''0 cheese 100
1 cheddar cheese 1100
2 gorgonzola 1300
3 smoked cheese 200'''
pd.Series(s.split('\n')).str.extract(r'.*?\s (?P<type>.*?)\s (?P<value>\d )')
This gives a Dataframe:
type value
0 cheese 100
1 cheddar cheese 1100
2 gorgonzola 1300
3 smoked cheese 200
CodePudding user response:
IIUC your strings are elements of a list. You can use re.split
to split where two or more spaces are found:
import re
import pandas as pd
your_list = [
"0 cheese 100",
"1 cheddar cheese 1100",
"2 gorgonzola 1300",
"3 smoked cheese 200",
]
df = pd.DataFrame([re.split(r'\s{2,}', s)[1:] for s in your_list], columns=["type", "value"])
Output:
type value
0 cheese 100
1 cheddar cheese 1100
2 gorgonzola 1300
3 smoked cheese 200
CodePudding user response:
I think something on these lines might work:
import pandas as pd
import re
mylist=['0 cheese 100','1 cheddar cheese 200']
numbers = '[0-9]'
list1=[i.split()[-1] for i in mylist]
list2=[re.sub(numbers, '', i).strip() for i in mylist]
your_df=pd.DataFrame({'name1':list1,'name2':list2})
your_df
CodePudding user response:
May I suggest this simple solution:
lines = [
"1 cheddar cheese 1100 ",
"2 gorgonzola 1300 ",
"3 smoked cheese 200",
]
for line in lines:
words = line.strip().split()
print( ' '.join( words[1:-1]), words[-1])
Result:
cheddar cheese 1100
gorgonzola 1300
smoked cheese 200
CodePudding user response:
You could achieve this by using slicing:
from curses.ascii import isdigit
inList = ['0 cheese 100', '1 cheddar cheese 1100', '2 gorgonzola 1300', '3 smoked cheese 200']
cheese = []
prices = []
for i in inList:
temp = i[:19:-1] #Cuts out first number and all empty spaces until first character and reverses the string
counter = 0
counter2 = 0
for char in temp: #Temp is reversed, meaning the number e.g. '100' for 'cheese' is in front but reversed
if char.isdigit():
counter = 1
else: #If the character is an empty space, we know the number is over
prices.append((temp[:counter])[::-1]) #We know where the number begins (at position 0) and ends (at position counter), we flip it and store it in prices
cheeseWithSpace = (temp[counter:]) #Since we cut out the number, the rest has to be the cheese name with some more spaces in front
for char in cheeseWithSpace:
if char == ' ': #We count how many spaces are in front
counter2 = 1
else: #If we reach something other than an empty space, we know the cheese name begins.
cheese.append(cheeseWithSpace[counter2:][::-1]) #We know where the cheese name begins (at position counter2) cut everything else out, flip it and store it
break
break
print(prices)
print(cheese)
View in-code comments to understand the approach. Basically you flip your strings around using [::-1] to make them easier to process. Then you remove every part one by one.