How to extract array element from array column?-CodePudding

I'm working with a dataset available here: https://www.kaggle.com/datasets/lehaknarnauli/spotify-datasets?select=artists.csv. What I want to do is to extract first element of each array in column genres. For example, if I got ['pop', 'rock'] I'd like to extract 'pop'. I tried different approaches but none of them works, I don't know why.

Here is my code:

import pandas as pd

df = pd.read_csv('artists.csv')

# approach 1
df['top_genre'] = df['genres'].str[0]
# Error: 'str' object has no attribute 'str'

# approach 2
df = df.assign(top_genre = lambda x: df['genres'].str[0])
# The result is single bracket '[' in each row. Seems like index=0 refers to first character of a string, not first array element.

# approach 3
df['top_genre'] = df['genres'].apply(lambda x: '[]' if not x else x[0])
# The result is single bracket '[' in each row. Seems like index=0 refers to first character of a string, not first array element.

Why these approaches doesn't work and how to make it work out?

CodePudding user response：

In your code, the genres column is a string representation of a list of genres, which means that each value in the column is a string enclosed in square brackets ([]).

To extract the first element of the list, you need to first convert the string to a list using the ast.literal_eval function from the ast module. This function safely evaluates a string containing a literal Python object and returns the corresponding object. however, because some rows are None, you should apply a function to check if the value is an empty list before calling ast.literal_eval:

import ast

def get_top_genre(x):
    if x == '[]':
        return None
    return ast.literal_eval(x)[0]

df['top_genre'] = df['genres'].apply(get_top_genre)

CodePudding user response：

Another way to do it:

import json
df["top_genre"]=df["genres"].apply(lambda x: None if x == '[]' else json.loads(x)[0])

CodePudding user response：

Your genres column seems to not actually be a list, but instead, a string that contains a list such as "['a', 'b']". You will have to use eval on the string to convert each row into a list object again, but for safety reasons, its better to use ast.literal_eval

Using Converter during reading the dataset

One way is to apply a converter while loading the dataset itself using the converters parameter. The advantage of this method is that you can do multiple transformations and typecasting using a single dictionary, which can apply on a large number of similar files at once, if needed.

from ast import literal_eval

df = pd.read_csv('/path_do_data/artists.csv', 
                 converters={'genres': literal_eval})
df['genres'].str[0]

0                        NaN
1                        NaN
2                        NaN
3                        NaN
4                        NaN
                 ...        
1104344                  NaN
1104345    deep acoustic pop
1104346                  NaN
1104347                  NaN
1104348                  NaN

Using apply method on a column

Another way to solve this is to just convert the string using literal_eval. This step needs multiple lines of code to overwrite existing columns but works as well, just a bit redundant in my opinion.

from ast import literal_eval

df = pd.read_csv('/path_do_data/artists.csv')
df['genres'] = df['genres'].apply(literal_eval)
df['genres'].str[0]

0                        NaN
1                        NaN
2                        NaN
3                        NaN
4                        NaN
                 ...        
1104344                  NaN
1104345    deep acoustic pop
1104346                  NaN
1104347                  NaN