Pandas create list from column values based on condition-CodePudding

I have a dataframe with columns with tags assigned to the text. I want to create a tags column, which would contain a list of all possible tags without NaN.

I can remove NaN from a single list, but unsure what is the most efficient way to remove them for all lists in the tags column. My dataframe contains 30,000 rows.

Any help would be greatly appreciated!

import pandas as pd

df = pd.DataFrame(data = {'text': ['Quinbrook acquires planned 350 MW project', 'Australian rooftop solar to shine bright', 'The US installed 5.7 GW of solar in Q2'],
                          'acquisition': ['acquisition', np.nan, np.nan], 'tender': [np.nan, np.nan, np.nan], 'opinion': [np.nan, 'opinion', np.nan]})

# get names of the tags 
tags = list(df.columns)
tags.remove('text')

# Create tags column
df['tags'] = df[tags].values.tolist()


# Remove NaN values from a single list

[x for x in df['tags'][0] if str(x) != 'nan']

# ['acquisition']

CodePudding user response：

If use pandas solution with reshape by DataFrame.stack and aggregate list is possible, but slow:

df['tags'] = df[tags].stack().groupby(level=0).agg(list).reindex(df.index, fill_value=[])
print (df)
                                        text  acquisition  tender  opinion  \
0  Quinbrook acquires planned 350 MW project  acquisition     NaN      NaN   
1   Australian rooftop solar to shine bright          NaN     NaN  opinion   
2     The US installed 5.7 GW of solar in Q2          NaN     NaN      NaN   

            tags  
0  [acquisition]  
1      [opinion]  
2             []

Your solution is faster if use nested list comprehension:

df['tags'] = [[y for y in x if str(y) != 'nan'] for x in df[tags].to_numpy()]

Or:

df['tags'] = [[y for y in x if pd.notna(y)] for x in df[tags].to_numpy()]

Performance in sample data for 30k rows:

df = pd.concat([df] * 10000, ignore_index=True)

tags = list(df.columns)
tags.remove('text')


In [129]: %timeit df['tags'] = df[tags].stack().groupby(level=0).agg(list).reindex(df.index, fill_value=[])
1.21 s ± 135 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [130]: %timeit df['tags'] = [[y for y in x if str(y) != 'nan'] for x in df[tags].to_numpy()]
76.2 ms ± 487 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [131]: %timeit df['tags'] = [[y for y in x if pd.notna(y)] for x in df[tags].to_numpy()]
110 ms ± 4.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)