I have a dataframe with columns with tags assigned to the text. I want to create a tags column, which would contain a list of all possible tags without NaN.
I can remove NaN from a single list, but unsure what is the most efficient way to remove them for all lists in the tags column. My dataframe contains 30,000 rows.
Any help would be greatly appreciated!
import pandas as pd
df = pd.DataFrame(data = {'text': ['Quinbrook acquires planned 350 MW project', 'Australian rooftop solar to shine bright', 'The US installed 5.7 GW of solar in Q2'],
'acquisition': ['acquisition', np.nan, np.nan], 'tender': [np.nan, np.nan, np.nan], 'opinion': [np.nan, 'opinion', np.nan]})
# get names of the tags
tags = list(df.columns)
tags.remove('text')
# Create tags column
df['tags'] = df[tags].values.tolist()
# Remove NaN values from a single list
[x for x in df['tags'][0] if str(x) != 'nan']
# ['acquisition']
CodePudding user response:
If use pandas solution with reshape by DataFrame.stack
and aggregate list
is possible, but slow:
df['tags'] = df[tags].stack().groupby(level=0).agg(list).reindex(df.index, fill_value=[])
print (df)
text acquisition tender opinion \
0 Quinbrook acquires planned 350 MW project acquisition NaN NaN
1 Australian rooftop solar to shine bright NaN NaN opinion
2 The US installed 5.7 GW of solar in Q2 NaN NaN NaN
tags
0 [acquisition]
1 [opinion]
2 []
Your solution is faster if use nested list comprehension:
df['tags'] = [[y for y in x if str(y) != 'nan'] for x in df[tags].to_numpy()]
Or:
df['tags'] = [[y for y in x if pd.notna(y)] for x in df[tags].to_numpy()]
Performance in sample data for 30k rows:
df = pd.concat([df] * 10000, ignore_index=True)
tags = list(df.columns)
tags.remove('text')
In [129]: %timeit df['tags'] = df[tags].stack().groupby(level=0).agg(list).reindex(df.index, fill_value=[])
1.21 s ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [130]: %timeit df['tags'] = [[y for y in x if str(y) != 'nan'] for x in df[tags].to_numpy()]
76.2 ms ± 487 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [131]: %timeit df['tags'] = [[y for y in x if pd.notna(y)] for x in df[tags].to_numpy()]
110 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)