I'm working with a dataset available here: https://www.kaggle.com/datasets/lehaknarnauli/spotify-datasets?select=artists.csv. What I want to do is to extract first element of each array in column genres
. For example, if I got ['pop', 'rock'] I'd like to extract 'pop'. I tried different approaches but none of them works, I don't know why.
Here is my code:
import pandas as pd
df = pd.read_csv('artists.csv')
# approach 1
df['top_genre'] = df['genres'].str[0]
# Error: 'str' object has no attribute 'str'
# approach 2
df = df.assign(top_genre = lambda x: df['genres'].str[0])
# The result is single bracket '[' in each row. Seems like index=0 refers to first character of a string, not first array element.
# approach 3
df['top_genre'] = df['genres'].apply(lambda x: '[]' if not x else x[0])
# The result is single bracket '[' in each row. Seems like index=0 refers to first character of a string, not first array element.
Why these approaches doesn't work and how to make it work out?
CodePudding user response:
In your code, the genres column is a string representation of a list of genres, which means that each value in the column is a string enclosed in square brackets ([]).
To extract the first element of the list, you need to first convert the string to a list using the ast.literal_eval
function from the ast
module. This function safely evaluates a string containing a literal Python object and returns the corresponding object. however, because some rows are None, you should apply a function to check if the value is an empty list before calling ast.literal_eval
:
import ast
def get_top_genre(x):
if x == '[]':
return None
return ast.literal_eval(x)[0]
df['top_genre'] = df['genres'].apply(get_top_genre)
CodePudding user response:
Another way to do it:
import json
df["top_genre"]=df["genres"].apply(lambda x: None if x == '[]' else json.loads(x)[0])
CodePudding user response:
Your genres
column seems to not actually be a list, but instead, a string that contains a list such as "['a', 'b']"
. You will have to use eval
on the string to convert each row into a list object again, but for safety reasons, its better to use ast.literal_eval
Using Converter during reading the dataset
One way is to apply a converter while loading the dataset itself using the converters
parameter. The advantage of this method is that you can do multiple transformations and typecasting using a single dictionary, which can apply on a large number of similar files at once, if needed.
from ast import literal_eval
df = pd.read_csv('/path_do_data/artists.csv',
converters={'genres': literal_eval})
df['genres'].str[0]
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
1104344 NaN
1104345 deep acoustic pop
1104346 NaN
1104347 NaN
1104348 NaN
Using apply method on a column
Another way to solve this is to just convert the string using literal_eval
. This step needs multiple lines of code to overwrite existing columns but works as well, just a bit redundant in my opinion.
from ast import literal_eval
df = pd.read_csv('/path_do_data/artists.csv')
df['genres'] = df['genres'].apply(literal_eval)
df['genres'].str[0]
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
1104344 NaN
1104345 deep acoustic pop
1104346 NaN
1104347 NaN