I have a column in a pandas dataframe that includes integers, lists of integers, strings, and combinations therein. For example:
df = pd.DataFrame({'A': [0, 1, '1, 2, z', 0, 2, '1, z', 'z', 0, 1]})
I want to be able to firstly identify the unique elements among these entries, and then to be able to sort by them. The first problem is recognizing the lists as collections of individual elements:
print(set(df['A'].values))
{0, 1, 2, '1, 2, z', 'z', '1, z'}
where I would ideally like to retrieve here is: {0, 1, 2, z}. I tried splitting the lists along the commas, but this ignores all the integers:
df_B = df['A'].str.split(',', expand=True)
print (df_B)
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 1 2 z
3 NaN NaN NaN
4 NaN NaN NaN
5 1 z None
6 z None None
7 NaN NaN NaN
8 NaN NaN NaN
I can get around that by first converting everything to strings:
df_tmp = df['A'].astype(str)
df_C = df_tmp.str.strip().str.split(',', expand=True)
print (df_C)
0 1 2
0 0 None None
1 1 None None
2 1 2 z
3 0 None None
4 2 None None
5 1 z None
6 z None None
7 0 None None
8 1 None None
print(set(np.concatenate(df_C.values)))
{'1', ' 2', 'z', None, '2', '0', ' z'}
But I still have a problem with white-space despite using str.strip(), so if I now look for occurrences of a particular element like '2', I don't find those with whitespaces:
df_keep = df_C.loc[df_C[0] == '2']
print (df_keep)
0 1 2
4 2 None None
In any case, I wonder if there isn't a better/more efficient way to achieve what I am trying to do than to convert everything to strings and to generate another dataframe with new columns for each element, since I then have to copy the indices into another list to retrieve the information from the corresponding rows of the original dataframe.
Is there a way to search through a mixture of individual elements and lists in the original dataframe?
CodePudding user response:
IIUC, convert all to string, split, explode and get the set:
s = df['A'].astype(str).str.split(',\s*').explode()
out = set(s)
output: {'0', '1', '2', 'z'}
To keep integers, you could do:
s = df['A'].astype(str).str.split(',\s*').explode(ignore_index=True)
s2 = pd.to_numeric(s, errors='coerce').dropna().astype(int)
out = set(s.loc[s.index.difference(s2.index)]) | set(s2)
output: {0, 1, 2, 'z'}