I need to extract specific values from data in a column on panda data frame and group another columns base on extracted values. the pattern that I need is U1 ,U2 ,U3 ,... U9.
first of all I found the values on data with match
:
filtered = df[df['coloum1'].str.match(r'(U\s*\d)') == True]
print(filtered['coloum1'])
The out put is like this:
9370 U 1 / U 2; Gleisdreieck, barrierefreier Ausbau
9371 U 1 / U 3; Tunnelsanierung
9372 U 1 / U 6; Hallesches Tor; barrierefreier Ausbau
9373 U 1 / U 8; Kottbusser Tor, barrierefreier Ausbau
9374 U 1 / U 9; Kurfürstendamm, barrierefreier Ausbau
...
34032 U9, Hansaplatz: barrierefreier Ausbau
34033 U9, Nauener Platz: barrierefreier Ausbau
34034 U9, Schloßstraße: barrierefreier Ausbau
34035 U9, Turmstraße: barrierefreier Ausbau
34250 U25
Now, I need extract those U1 ,... U9 . I change my code like this:
extracted = df[df['coloum1'].str.extract(r'(U\s*\d)') == True]
but I got only NaN as value
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
40815 NaN
40816 NaN
40817 NaN
40818 NaN
40819 NaN
CodePudding user response:
By doing df['coloum1'].str.extract(r'(U\s*\d)') == True
, you're implicitly calling pandas.DataFrame.eq
.
Get Equal to of dataframe and other, element-wise (binary operator eq)
Returns : DataFrame of bool
You can simply use this to get the the first match :
extracted = df['coloum1'].str.extract(r'(U\s*\d)')
Or pandas.Series.str.findall
to get a list of all the matches :
extracted = df['coloum1'].str.findall(r'(U\s*\d)')
# Output :
extract findall
9370 U 1 [U 1, U 2]
9371 U 1 [U 1, U 3]
9372 U 1 [U 1, U 6]
9373 U 1 [U 1, U 8]
9374 U 1 [U 1, U 9]
... ... ...
34032 U9 [U9]
34033 U9 [U9]
34034 U9 [U9]
34035 U9 [U9]
34250 U2 [U2]
[11 rows x 2 columns]
# Edit:
Based on the comments, you can one of these solutions depending on your expectations :
donatation = pd.read_csv(os.path.join("zuwendungen-berlin.csv"))
# --- To create a Series with the extracted value (#Return: Series)
extracted = donatation['Zweck'].str.extract(r'(U\s*\d)', expand=False)
print(extracted)
# --- To create a new DataFrame with a single column containing (if possible) the extracted value (#Return: DataFrame)
extracted = donatation['Zweck'].str.extract(r'(U\s*\d)')
print(extracted)
# --- To create a new Column with the extracted value in the original dataframe (#Return: DataFrame)
extracted = donatation.assingn(New_Zweck= donatation['Zweck'].str.extract(r'(U\s*\d)', expand=False))
print(extracted['New_Zweck'])
# --- To filter the original dataframe without creating a column (#Return: DataFrame)
extracted = donatation.loc[donatation['Zweck'].str.contains(r'(U\s*\d)', na=False)]
print(extracted['Zweck'])
# --- To create a column with the extracted value and remove NaN values/rows (#Return: DataFrame)
donatation['New_Zweck'] = donatation['Zweck'].str.extract(r'(U\s*\d)')
extracted = donatation.loc[donatation['New_Zweck'].notna()]
print(extracted['New_Zweck'])