Home > Software design >  Trying to extract a string value between dots from a string column
Trying to extract a string value between dots from a string column

Time:12-29

im trying to extract a default string value from a string column but without sucess.

This is the column with some string values:

      Unnamed: 0
57    PA.SQNTURBO.Bb20.0-Serquinutri Turbo - Bb20 L
58    PA.SQNTURBO.Fr 1l.0-Serquinutri Turbo - Frasco 1l
59    PA.SQNZ10.Bb 20.0-Serquinutri Zinco 10 - Bb 20 L
60    PA.sqnbor.Bb 20.0-Serquinutri Serquibor - Bb 20 L
61    PA.sqnbor.Bb 5.0-Serquinutri Serquibor - Bb 5l

What im trying to reach:

      Unnamed: 0
57    SQNTURBO.Bb20
58    SQNTURBO.Fr 1l
59    SQNZ10.Bb 20
60    sqnbor.Bb 20
61    sqnbor.Bb 5

this is my unsucess code:


all_months["Unnamed: 0"] = all_months["Unnamed: 0"].str.extract(r"/.(.*)./", expand=False)
all_months

and the result...

      Unnamed: 0
57    NaN
58    NaN
59    NaN
60    NaN
61    NaN

Could you guys help me? i have a kinda difficult with regex and it just blow my mind when i have to deal with it.

CodePudding user response:

You get no matches as there are no slashes in your strings.

You can use

all_months["Unnamed: 0"].str.extract(r"\.([^.]*\.[^.]*)", expand=False)

See the regex demo. Series.str.extract will extract the first occurrence of the regex match. Details:

  • \. - a . char
  • ([^.]*\.[^.]*) - Group 1 (the value returned by the Series.str.extract): zero or more non-. chars, a . char, and then zero or more chars other than ..

Pandas test:

import pandas as pd
all_months = pd.DataFrame({'Unnamed: 0':['PA.SQNTURBO.Bb20.0-Serquinutri Turbo - Bb20 L',
    'PA.SQNTURBO.Fr 1l.0-Serquinutri Turbo - Frasco 1l',
    'PA.SQNZ10.Bb 20.0-Serquinutri Zinco 10 - Bb 20 L',
    'PA.sqnbor.Bb 20.0-Serquinutri Serquibor - Bb 20 L',
    'PA.sqnbor.Bb 5.0-Serquinutri Serquibor - Bb 5l']})
# >>> all_months["Unnamed: 0"].str.extract(r"\.([^.]*\.[^.]*)", expand=False)
# 0     SQNTURBO.Bb20
# 1    SQNTURBO.Fr 1l
# 2      SQNZ10.Bb 20
# 3      sqnbor.Bb 20
# 4       sqnbor.Bb 5

CodePudding user response:

With your shown samples, please try following Pandas code using .str.extract function.

all_months["Unnamed: 0"].str.extract(r"^(?:[^.]*\.)([^.]*\.[^.]*)", expand=False)

Online demo for above regex

Explanation of code: Simple explanation would be, using .str.extract function of Pandas on Unnamed: 0 column of all_months DataFrame. Using regex in its main program and creating only 1 capturing group to get only required output as per shown samples.

Explanation of regex:

^(?:[^.]*\.)   ##From starting creating a non-capturing group where matching everything;
               ##from starting till 1st occurrence of dot here.
([^.]*\.[^.]*) ##Creating 1st and only capturing group of this solution, where matching;
               ##everything till dot including dot, followed by everything else before next occurrence of dot.
  • Related