I want to print the unique values taken in this column and not the numerical ones. I only want to output the values taken before the special characters (when there is one) and I don't want the second part of the string. For example for the row "lala :59 lzenvke" I don't want to take into account "lzenvke" but only "lala"
import pandas as pd
data1 = {
'column_with_names': ['lala :56 javcejhv', 'lala56 : javcejhv' 'li :lo 7TUF', 'lo','lala :59 lzenvke','la','lala','lalalo'],
}
df1 = pd.DataFrame(data1)
print(df1)
the expected output would be:
CodePudding user response:
here is one way about it
Assumption: rows that don't have : are also included in the result set
import numpy as np
# split the values on colon (:), limited to 1 split, and form list (with expand)
# take the first element
# find unique using np.unique
# finally create a DF
pd.DataFrame(np.unique(df['column_with_names'].str.split(r'[\s|:]', 1, expand=True)[0]))
0
0 la
1 lala
2 lala
3 lalalo
4 li
5 lo
if you only need to consider the rows with the colon in it
# same as above, except filter out the rows with colon beforehand
(pd.DataFrame(
np.unique(df.loc[df['column_with_names'].str.contains(':')]['column_with_names']
.str.split('[\s|:]', 1, expand=True)[0])))
0
0 lala
1 li