Home > OS >  Filtering pandas data frame based on different entries in a column (list of comma-separated strings)
Filtering pandas data frame based on different entries in a column (list of comma-separated strings)

Time:08-29

I have a Pandas DataFrame where I want to filter for all "IDs" that have different entries in "TCK" (list of comma-separated strings), i.e. are not the same for all entries.

My DataFrame looks likes this:

df1 = pd.DataFrame({"ID": [1, 2, 3, 4],
        "TCK": [["AA, AA, AC"], ["LL, LL"], ["DD , DB, DF, DE"], ["LO , LO, LO, LO, LO, LO"]]})

The desired output should look like this:

df2 = pd.DataFrame({"ID": [1, 3],
        "TCK": [["AA, AA, AC"],["DD , DB, DF, DE"]]})

I know that one way would be to first split the strings into new columns (based on commas) and then use a loop to identify the different tickers. However, since there would also be np.nans, this would be a rather complicated solution.

Does anyone know a speedy and elegant solution to this problem?

CodePudding user response:

You may first split your strings and apply set to check if the set has more than 1 element, i.e., has different elements:

tck = df1.TCK.apply(lambda x: [item.strip() for item in x[0].split(',')])
df1[tck.apply(lambda x: len(set(x)) > 1)]

CodePudding user response:

Using str.extractall:

m = df1['TCK'].str[0].str.extractall('(\w )')[0].groupby(level=0).nunique().gt(1)

df1[m]

output:

   ID                TCK
0   1       [AA, AA, AC]
2   3  [DD , DB, DF, DE]
  • Related