Sort of a wordy title, but I have a large dataframe that has lists with either single values or no values.
import pandas as pd
data = {'Name': ['BN #1', 'HHC', 'A comp', 'B comp', BN #2', 'HHC', 'A comp', 'B comp'],
'SysNum': ['["1"]', [], [], [], '["2"]', [], [], []]
}
df = pd.DataFrame(data)
result =
Name SysNum
0 BN #1 ['1']
1 HHC []
2 A comp []
3 B comp []
4 BN #2 ['2']
5 HHC []
6 A comp []
7 B comp []
This is a small subset of the larger dataframe, the rest is the same 4 options that reoccur but BN # increases by 1 each time. Is there a way to create a new column that pulls out the "1" from the list and turns it into an integer? I have tried a couple different ways but I have had no luck. The final goal is to have a column of just integers so I can search for the max.
CodePudding user response:
Because some of your data is a list []
and some of it is a string of a list '[]'
you will need to convert everything to a string, use literal_eval
to convert everything to a list and then explode - assuming that there is only one int in each list
import pandas as pd
from ast import literal_eval
data = {'Name': ['BN #1', 'HHC', 'A comp', 'B comp', 'BN #2', 'HHC', 'A comp', 'B comp'],
'SysNum': ['["1"]', [], [], [], '["2"]', [], [], []]
}
df = pd.DataFrame(data)
df['SysNum'] = df['SysNum'].astype(str).apply(literal_eval).explode().astype(float)
Name SysNum
0 BN #1 1.0
1 HHC NaN
2 A comp NaN
3 B comp NaN
4 BN #2 2.0
5 HHC NaN
6 A comp NaN
7 B comp NaN
CodePudding user response:
As your ultimate goal is to extract each number from the string lists and turn it into an integer and the final goal is to have a column of just integers, you can do it as follows:
df['SysNum'] = df['SysNum'].astype('str').str.strip(r'\'"[]').map(lambda x: pd.to_numeric(x, errors='coerce')).ffill(downcast='infer')
Here, we change the column to string type, then remove all characters of "
, '
, [
and ]
by using .str.strip()
. Then, making use of pd.to_numeric
to convert the single number from string to integer type. Finally, we forward fill these converted integer to other locations originally with empty list with .ffill()
.
Result:
print(df)
Name SysNum
0 BN #1 1
1 HHC 1
2 A comp 1
3 B comp 1
4 BN #2 2
5 HHC 2
6 A comp 2
7 B comp 2
CodePudding user response:
Previous two answers is well describe approach to solve this problem.
My solution just add some performance if needed.
import numpy as np
df['result'] = 0
df.result = np.where(
len(df.SysNum)>0,
df.SysNum.str.strip('"[]'),
0
)
Test
import pandas as pd
import time
data = {'Name': ['BN #1', 'HHC', 'A comp', 'B comp', 'BN #2', 'HHC', 'A comp', 'B comp']*100000,
'SysNum': ['["1"]', [], [], [], '["2"]', [], [], []]*100000
}
df = pd.DataFrame(data)
start = time.time()
import numpy as np
df['result'] = 0
df.result = np.where(
len(df.SysNum)>0,
df.SysNum.str.strip('"[]'),
0
)
print(time.time() - start)
# SeaBean
from ast import literal_eval
start = time.time()
df['res'] = df['SysNum'].astype(str).apply(literal_eval).explode().astype(float)
print(time.time() - start)
# It_is_Chris
start = time.time()
df['res1'] = df['SysNum'].astype('str').str.strip(r'\'"[]').map(lambda x: pd.to_numeric(x, errors='coerce')).ffill(downcast='infer')
print(time.time() - start)
On my Macbook Air M1 gives (I hope that this results is not related to Arm, cause np.where
normally fastest solution):
0.3059520721435547
2.994331121444702
4.511949062347412