I have a Pandas DataFrame where I want to filter for all "IDs" that have different entries in "TCK" (list of comma-separated strings), i.e. are not the same for all entries.
My DataFrame looks likes this:
df1 = pd.DataFrame({"ID": [1, 2, 3, 4],
"TCK": [["AA, AA, AC"], ["LL, LL"], ["DD , DB, DF, DE"], ["LO , LO, LO, LO, LO, LO"]]})
The desired output should look like this:
df2 = pd.DataFrame({"ID": [1, 3],
"TCK": [["AA, AA, AC"],["DD , DB, DF, DE"]]})
I know that one way would be to first split the strings into new columns (based on commas) and then use a loop to identify the different tickers. However, since there would also be np.nans, this would be a rather complicated solution.
Does anyone know a speedy and elegant solution to this problem?
CodePudding user response:
You may first split your strings and apply set
to check if the set has more than 1 element, i.e., has different elements:
tck = df1.TCK.apply(lambda x: [item.strip() for item in x[0].split(',')])
df1[tck.apply(lambda x: len(set(x)) > 1)]
CodePudding user response:
Using str.extractall
:
m = df1['TCK'].str[0].str.extractall('(\w )')[0].groupby(level=0).nunique().gt(1)
df1[m]
output:
ID TCK
0 1 [AA, AA, AC]
2 3 [DD , DB, DF, DE]