I have below data
system_id software_versions
0 1001 ['VA100', 'VB101']
1 1002 ['VA100', 'VA101']
2 1003 ['VA100', 'VB102','VC101']
I need to return rows which has software_versions beginning with VA only.
In this example, only second row matches this criteria, as it contains versions VA100 and VA101.
But other rows have some other version which is not starting with VA
Expected output
system_id software_versions
1 1002 ['VA100', 'VA101']
I began with this code. I'm not able to proceed further as how I can achieve this.
import pandas as pd
df = pd.DataFrame([[1001,"['VA100', 'VB101']"], [1002,"['VA100', 'VA101']"], [1003,"['VA100', 'VB102','VC101']"]],columns=['system_id', 'software_versions'])
Note: Actual data will come from a csv file, it has more than 1000 entries. I have provided the exact sample data here.
CodePudding user response:
Idea is use generator for test if all values in list starting by VA
and filter in boolean indexing
:
import ast
df['software_versions'] = df['software_versions'].apply(ast.literal_eval)
df = df[df['software_versions'].map(lambda x: all(y.startswith('VA') for y in x))]
print (df)
system_id software_versions
1 1001 [VA100, VA101]