im trying to extract a default string value from a string column but without sucess.
This is the column with some string values:
Unnamed: 0
57 PA.SQNTURBO.Bb20.0-Serquinutri Turbo - Bb20 L
58 PA.SQNTURBO.Fr 1l.0-Serquinutri Turbo - Frasco 1l
59 PA.SQNZ10.Bb 20.0-Serquinutri Zinco 10 - Bb 20 L
60 PA.sqnbor.Bb 20.0-Serquinutri Serquibor - Bb 20 L
61 PA.sqnbor.Bb 5.0-Serquinutri Serquibor - Bb 5l
What im trying to reach:
Unnamed: 0
57 SQNTURBO.Bb20
58 SQNTURBO.Fr 1l
59 SQNZ10.Bb 20
60 sqnbor.Bb 20
61 sqnbor.Bb 5
this is my unsucess code:
all_months["Unnamed: 0"] = all_months["Unnamed: 0"].str.extract(r"/.(.*)./", expand=False)
all_months
and the result...
Unnamed: 0
57 NaN
58 NaN
59 NaN
60 NaN
61 NaN
Could you guys help me? i have a kinda difficult with regex and it just blow my mind when i have to deal with it.
CodePudding user response:
You get no matches as there are no slashes in your strings.
You can use
all_months["Unnamed: 0"].str.extract(r"\.([^.]*\.[^.]*)", expand=False)
See the regex demo. Series.str.extract
will extract the first occurrence of the regex match. Details:
\.
- a.
char([^.]*\.[^.]*)
- Group 1 (the value returned by theSeries.str.extract
): zero or more non-.
chars, a.
char, and then zero or more chars other than.
.
Pandas test:
import pandas as pd
all_months = pd.DataFrame({'Unnamed: 0':['PA.SQNTURBO.Bb20.0-Serquinutri Turbo - Bb20 L',
'PA.SQNTURBO.Fr 1l.0-Serquinutri Turbo - Frasco 1l',
'PA.SQNZ10.Bb 20.0-Serquinutri Zinco 10 - Bb 20 L',
'PA.sqnbor.Bb 20.0-Serquinutri Serquibor - Bb 20 L',
'PA.sqnbor.Bb 5.0-Serquinutri Serquibor - Bb 5l']})
# >>> all_months["Unnamed: 0"].str.extract(r"\.([^.]*\.[^.]*)", expand=False)
# 0 SQNTURBO.Bb20
# 1 SQNTURBO.Fr 1l
# 2 SQNZ10.Bb 20
# 3 sqnbor.Bb 20
# 4 sqnbor.Bb 5
CodePudding user response:
With your shown samples, please try following Pandas code using .str.extract
function.
all_months["Unnamed: 0"].str.extract(r"^(?:[^.]*\.)([^.]*\.[^.]*)", expand=False)
Explanation of code: Simple explanation would be, using .str.extract
function of Pandas on Unnamed: 0
column of all_months DataFrame. Using regex in its main program and creating only 1 capturing group to get only required output as per shown samples.
Explanation of regex:
^(?:[^.]*\.) ##From starting creating a non-capturing group where matching everything;
##from starting till 1st occurrence of dot here.
([^.]*\.[^.]*) ##Creating 1st and only capturing group of this solution, where matching;
##everything till dot including dot, followed by everything else before next occurrence of dot.