panda extract() operator return NaN-CodePudding

I need to extract specific values from data in a column on panda data frame and group another columns base on extracted values. the pattern that I need is U1 ,U2 ,U3 ,... U9.

first of all I found the values on data with match:

filtered = df[df['coloum1'].str.match(r'(U\s*\d)') == True]
print(filtered['coloum1'])

The out put is like this:

9370       U 1 / U 2; Gleisdreieck, barrierefreier Ausbau
9371                           U 1 / U 3; Tunnelsanierung
9372     U 1 / U 6; Hallesches Tor; barrierefreier Ausbau
9373     U 1 / U 8; Kottbusser Tor, barrierefreier Ausbau
9374     U 1 / U 9; Kurfürstendamm, barrierefreier Ausbau
                               ...                       
34032               U9, Hansaplatz: barrierefreier Ausbau
34033            U9, Nauener Platz: barrierefreier Ausbau
34034             U9, Schloßstraße: barrierefreier Ausbau
34035               U9, Turmstraße: barrierefreier Ausbau
34250                                                 U25

Now, I need extract those U1 ,... U9 . I change my code like this:

extracted = df[df['coloum1'].str.extract(r'(U\s*\d)') == True]

but I got only NaN as value

0        NaN
1        NaN
2        NaN
3        NaN
4        NaN
        ... 
40815    NaN
40816    NaN
40817    NaN
40818    NaN
40819    NaN

CodePudding user response：

By doing df['coloum1'].str.extract(r'(U\s*\d)') == True, you're implicitly calling pandas.DataFrame.eq.

Get Equal to of dataframe and other, element-wise (binary operator eq)
Returns : DataFrame of bool

You can simply use this to get the the first match :

extracted = df['coloum1'].str.extract(r'(U\s*\d)')

Or pandas.Series.str.findall to get a list of all the matches :

extracted = df['coloum1'].str.findall(r'(U\s*\d)')

# Output :

      extract     findall
9370      U 1  [U 1, U 2]
9371      U 1  [U 1, U 3]
9372      U 1  [U 1, U 6]
9373      U 1  [U 1, U 8]
9374      U 1  [U 1, U 9]
...       ...         ...
34032      U9        [U9]
34033      U9        [U9]
34034      U9        [U9]
34035      U9        [U9]
34250      U2        [U2]

[11 rows x 2 columns]

# Edit:

Based on the comments, you can one of these solutions depending on your expectations :

donatation = pd.read_csv(os.path.join("zuwendungen-berlin.csv"))

# --- To create a Series with the extracted value (#Return: Series)
extracted = donatation['Zweck'].str.extract(r'(U\s*\d)', expand=False)
print(extracted)

# --- To create a new DataFrame with a single column containing (if possible) the extracted value (#Return: DataFrame)
extracted = donatation['Zweck'].str.extract(r'(U\s*\d)')
print(extracted)

# --- To create a new Column with the extracted value in the original dataframe (#Return: DataFrame)
extracted = donatation.assingn(New_Zweck= donatation['Zweck'].str.extract(r'(U\s*\d)', expand=False))
print(extracted['New_Zweck'])

# --- To filter the original dataframe without creating a column (#Return: DataFrame)
extracted = donatation.loc[donatation['Zweck'].str.contains(r'(U\s*\d)', na=False)]
print(extracted['Zweck'])

# --- To create a column with the extracted value and remove NaN values/rows (#Return: DataFrame)
donatation['New_Zweck'] = donatation['Zweck'].str.extract(r'(U\s*\d)')
extracted = donatation.loc[donatation['New_Zweck'].notna()]
print(extracted['New_Zweck'])