Retrieving a value from a single value list within a dataframe with empty lists also within the data-CodePudding

Sort of a wordy title, but I have a large dataframe that has lists with either single values or no values.

import pandas as pd

data = {'Name': ['BN #1', 'HHC', 'A comp', 'B comp', BN #2', 'HHC', 'A comp', 'B comp'],
        'SysNum': ['["1"]', [], [], [], '["2"]', [], [], []]
            }

df = pd.DataFrame(data)

result =

    Name    SysNum
0   BN #1   ['1']
1   HHC     []
2   A comp  []
3   B comp  []
4   BN #2   ['2']
5   HHC     []
6   A comp  []
7   B comp  []

This is a small subset of the larger dataframe, the rest is the same 4 options that reoccur but BN # increases by 1 each time. Is there a way to create a new column that pulls out the "1" from the list and turns it into an integer? I have tried a couple different ways but I have had no luck. The final goal is to have a column of just integers so I can search for the max.

CodePudding user response：

Because some of your data is a list [] and some of it is a string of a list '[]' you will need to convert everything to a string, use literal_eval to convert everything to a list and then explode - assuming that there is only one int in each list

import pandas as pd
from ast import literal_eval

data = {'Name': ['BN #1', 'HHC', 'A comp', 'B comp', 'BN #2', 'HHC', 'A comp', 'B comp'],
        'SysNum': ['["1"]', [], [], [], '["2"]', [], [], []]
            }

df = pd.DataFrame(data)
df['SysNum'] = df['SysNum'].astype(str).apply(literal_eval).explode().astype(float)

     Name SysNum
0   BN #1    1.0
1     HHC    NaN
2  A comp    NaN
3  B comp    NaN
4   BN #2    2.0
5     HHC    NaN
6  A comp    NaN
7  B comp    NaN

CodePudding user response：

As your ultimate goal is to extract each number from the string lists and turn it into an integer and the final goal is to have a column of just integers, you can do it as follows:

df['SysNum'] = df['SysNum'].astype('str').str.strip(r'\'"[]').map(lambda x: pd.to_numeric(x, errors='coerce')).ffill(downcast='infer')

Here, we change the column to string type, then remove all characters of ", ', [ and ] by using .str.strip(). Then, making use of pd.to_numeric to convert the single number from string to integer type. Finally, we forward fill these converted integer to other locations originally with empty list with .ffill().

Result:

print(df)

     Name  SysNum
0   BN #1       1
1     HHC       1
2  A comp       1
3  B comp       1
4   BN #2       2
5     HHC       2
6  A comp       2
7  B comp       2

CodePudding user response：

Previous two answers is well describe approach to solve this problem.

My solution just add some performance if needed.

import numpy as np
df['result'] = 0
df.result = np.where(
    len(df.SysNum)>0,
    df.SysNum.str.strip('"[]'),
    0
)

Test

import pandas as pd
import time
data = {'Name': ['BN #1', 'HHC', 'A comp', 'B comp', 'BN #2', 'HHC', 'A comp', 'B comp']*100000,
        'SysNum': ['["1"]', [], [], [], '["2"]', [], [], []]*100000
        }

df = pd.DataFrame(data)
start = time.time()
import numpy as np
df['result'] = 0
df.result = np.where(
    len(df.SysNum)>0,
    df.SysNum.str.strip('"[]'),
    0
)
print(time.time() - start)
# SeaBean
from ast import literal_eval
start = time.time()
df['res'] = df['SysNum'].astype(str).apply(literal_eval).explode().astype(float)
print(time.time() - start)
# It_is_Chris
start = time.time()
df['res1'] = df['SysNum'].astype('str').str.strip(r'\'"[]').map(lambda x: pd.to_numeric(x, errors='coerce')).ffill(downcast='infer')
print(time.time() - start)

On my Macbook Air M1 gives (I hope that this results is not related to Arm, cause np.where normally fastest solution):

0.3059520721435547
2.994331121444702
4.511949062347412