How to split the text and number here-CodePudding

I have a the below column which is in python dataframe

But, the expected output should be like

Any help in Python please?

CodePudding user response：

You can first extract value from the list with pandas.apply and then use Series.str.extract.

df['words'] = df['words'].apply(lambda x: x[0])
df = df.join(df['words'].str.extract(r'(.*\d) (.*)')
            ).rename(columns = {1:'text', 0:'numbers'}
                    ).drop(columns='words')
print(df)

     numbers             text
0       4,25           I Dose
1        850          ml Dose
2  1: 10 800          ml Dose
3    1 x 390             Dose
4    5 x 200  Schrumpfpachung

CodePudding user response：

It's not clear from your question whether words contains strings with brackets in them or lists of one string each, so here are solutions for both cases.

DataFrame containing strings including square brackets:

pat='\[(.*\d) (.*)\]'
#pat='\[((?: *[^ ] ) ?)  ((?:[a-zA-Z]  *) )\]'
df = df.words.str.extract(pat).rename(columns={0:'numbers',1:'text'})

Input:

df = pd.DataFrame({
'words':['[4,25 l Dose]','[850 ml Dose]','[1 : 10 800 ml Dose]','[5 x 200 Schrumpfpackung]']
})

Output:

      numbers             text
0        4,25           l Dose
1         850          ml Dose
2  1 : 10 800          ml Dose
3     5 x 200  Schrumpfpackung

Explanation:

using the Series.str() accessor and extract(), extract the capture groups from the string in each item as two columns corresponding to numbers and text columns
rename the auto-generated columns labeled 0 and 1 to have the desired labels (numbers and text).

DataFrame containing lists with one string each:

pat='(.*\d) (.*)'
#pat='((?: *[^ ] ) ?)  ((?:[a-zA-Z]  *) )'
df = df.words.str.get(0).str.extract(pat).rename(columns={0:'numbers',1:'text'})

Input:

df = pd.DataFrame({
'words':[['4,25 l Dose'],['850 ml Dose'],['1 : 10 800 ml Dose'],['5 x 200 Schrumpfpackung']]
})

Output

same as above ...

Explanation:

using the Series.str() accessor to manipulate each element of type list (slightly confusing as the accessor is named str() but is more generally applicable to pandas Series whose values are python sequences, including both str and list) and get(), access the first (and only) string in each list in the columns
use extract() to get the capture groups from the string in each item as two columns corresponding to numbers and text columns
rename the auto-generated columns labeled 0 and 1 to have the desired labels (numbers and text).

Note that either the pat used above or the one that is commented out will work for each respective case (string or list). I started with the commented-out pattern, but the pattern in the answer by @Pieter Geelen seems to work as well and is simpler, so I have included it (and a variant of it) above.

CodePudding user response：

you might want to try out Regex With the pattern: (.*\d) (.*) you can already outline that you should have something that ends with a number, that is trailed with some describtion in words.

import re
some_strings = """4,25 l Dose
850 ml Dose
1: 10 800 ml Dose
1 x 390 Dose 
5 x 200 Schrumpfverpackung """.split('\n')

for item in some_strings:
    m = re.search("(.*\d) (.*)", item)
    print(m.group(0))
    print(m.group(1))
    print(m.group(2))

That prints out

4,25 l Dose
4,25
l Dose
850 ml Dose
850
ml Dose
1: 10 800 ml Dose
1: 10 800
ml Dose
1 x 390 Dose 
1 x 390
Dose 
5 x 200 Schrumpfverpackung 
5 x 200
Schrumpfverpackung

EDIT

For a pandas series you need to note that you need the iteritems method and use the second item in the tuple, as the first is the index itself.

import re
some_strings = pd.Series("""4,25 l Dose
850 ml Dose
1: 10 800 ml Dose
1 x 390 Dose 
5 x 200 Schrumpfverpackung """.split('\n'))

for item in some_strings.iteritems():
    m = re.search("(.*\d) (.*)", item[1])
    print("first group:", m.group(0))
    print("second group:", m.group(1))
    print("third group:", m.group(2))