Home > other >  Retrieving a value from a single value list within a dataframe with empty lists also within the data
Retrieving a value from a single value list within a dataframe with empty lists also within the data

Time:10-22

Sort of a wordy title, but I have a large dataframe that has lists with either single values or no values.

import pandas as pd

data = {'Name': ['BN #1', 'HHC', 'A comp', 'B comp', BN #2', 'HHC', 'A comp', 'B comp'],
        'SysNum': ['["1"]', [], [], [], '["2"]', [], [], []]
            }

df = pd.DataFrame(data)

result =

    Name    SysNum
0   BN #1   ['1']
1   HHC     []
2   A comp  []
3   B comp  []
4   BN #2   ['2']
5   HHC     []
6   A comp  []
7   B comp  []

This is a small subset of the larger dataframe, the rest is the same 4 options that reoccur but BN # increases by 1 each time. Is there a way to create a new column that pulls out the "1" from the list and turns it into an integer? I have tried a couple different ways but I have had no luck. The final goal is to have a column of just integers so I can search for the max.

CodePudding user response:

Because some of your data is a list [] and some of it is a string of a list '[]' you will need to convert everything to a string, use literal_eval to convert everything to a list and then explode - assuming that there is only one int in each list

import pandas as pd
from ast import literal_eval

data = {'Name': ['BN #1', 'HHC', 'A comp', 'B comp', 'BN #2', 'HHC', 'A comp', 'B comp'],
        'SysNum': ['["1"]', [], [], [], '["2"]', [], [], []]
            }

df = pd.DataFrame(data)
df['SysNum'] = df['SysNum'].astype(str).apply(literal_eval).explode().astype(float)

     Name SysNum
0   BN #1    1.0
1     HHC    NaN
2  A comp    NaN
3  B comp    NaN
4   BN #2    2.0
5     HHC    NaN
6  A comp    NaN
7  B comp    NaN

CodePudding user response:

As your ultimate goal is to extract each number from the string lists and turn it into an integer and the final goal is to have a column of just integers, you can do it as follows:

df['SysNum'] = df['SysNum'].astype('str').str.strip(r'\'"[]').map(lambda x: pd.to_numeric(x, errors='coerce')).ffill(downcast='infer')

Here, we change the column to string type, then remove all characters of ", ', [ and ] by using .str.strip(). Then, making use of pd.to_numeric to convert the single number from string to integer type. Finally, we forward fill these converted integer to other locations originally with empty list with .ffill().

Result:

print(df)

     Name  SysNum
0   BN #1       1
1     HHC       1
2  A comp       1
3  B comp       1
4   BN #2       2
5     HHC       2
6  A comp       2
7  B comp       2

CodePudding user response:

Previous two answers is well describe approach to solve this problem.

My solution just add some performance if needed.

import numpy as np
df['result'] = 0
df.result = np.where(
    len(df.SysNum)>0,
    df.SysNum.str.strip('"[]'),
    0
)

Test

import pandas as pd
import time
data = {'Name': ['BN #1', 'HHC', 'A comp', 'B comp', 'BN #2', 'HHC', 'A comp', 'B comp']*100000,
        'SysNum': ['["1"]', [], [], [], '["2"]', [], [], []]*100000
        }

df = pd.DataFrame(data)
start = time.time()
import numpy as np
df['result'] = 0
df.result = np.where(
    len(df.SysNum)>0,
    df.SysNum.str.strip('"[]'),
    0
)
print(time.time() - start)
# SeaBean
from ast import literal_eval
start = time.time()
df['res'] = df['SysNum'].astype(str).apply(literal_eval).explode().astype(float)
print(time.time() - start)
# It_is_Chris
start = time.time()
df['res1'] = df['SysNum'].astype('str').str.strip(r'\'"[]').map(lambda x: pd.to_numeric(x, errors='coerce')).ffill(downcast='infer')
print(time.time() - start)

On my Macbook Air M1 gives (I hope that this results is not related to Arm, cause np.where normally fastest solution):

0.3059520721435547
2.994331121444702
4.511949062347412
  • Related