find the rows with more than 4 values in a list in a column-CodePudding

The dataframe I have, df:

      name        list

0    kfjh       [[a,b,c],[d,f,h],[g,k,l]]
1    jhkg       [[a,b,c],[d,f,h],[g,k,l],[f,k,j]]
2    khfg       [[a,b,c],[g,k,l]]
3    khkjgr     [[a,b,c],[d,f,h]]
4    kjrgjg     [[d,f,h]]
5    jkdgr      [[a,b,c],[d,f,h],[g,k,l, [g,j,l],[f,l,p]]
6    hgyr       [[a,b,c],[d,kf,h],[g,k,l, [g,j,l],[f,l,p]]
7    jkgtjd     [[f,l,p]]
8    nkjgrd     [t,t,i]

if the list has more than 4 list, then I would like to get df1. The desired output, df1 :

    name              list


5    jkdgr      [[a,b,c],[d,f,h],[g,k,l, [g,j,l],[f,l,p]]
6    hgyr       [[a,b,c],[d,kf,h],[g,k,l, [g,j,l],[f,l,p]]

and, df2:

     name        list

0    kfjh       [[a,b,c],[d,f,h],[g,k,l]]
1    jhkg       [[a,b,c],[d,f,h],[g,k,l],[f,k,j]]
2    khfg       [[a,b,c],[g,k,l]]
3    khkjgr     [[a,b,c],[d,f,h]]
4    kjrgjg     [[d,f,h]]
7    jkgtjd     [[f,l,p]]
8    nkjgrd     [t,t,i]

CodePudding user response：

You can do something like this if column list is a string. if the list is list of lists with every element as a string, you can change the split for only len of the array and compare to 4 to do it.

import pandas as pd

data = {
    'name': ['kfjh', 'jhkg', 'khfg', 'khkjgr', 'kjrgjg', 'jkdgr', 'hgyr', 'jkgtjd', 'nkjgrd'],
    'list': ['[[a,b,c],[d,f,h],[g,k,l]]', '[[a,b,c],[d,f,h],[g,k,l],[f,k,j]]', '[[a,b,c],[g,k,l]]', '[[a,b,c],[d,f,h]]', '[[d,f,h]]', '[[a,b,c],[d,f,h],[g,k,l],[g,j,l],[f,l,p]]', '[[a,b,c],[d,f,h],[g,kf,l],[g,j,l],[f,l,p]]', '[[f,l,p]]', '[t,t,i]']
}

df = pd.DataFrame(data)

df['drop'] = df.apply(lambda row : 'no' if len(row['list'].split('[')) > 6 else 'yes', axis = 1)

df1 = df.loc[df['drop'] == 'yes']
df2 = df.loc[df['drop'] == 'no']

df1 = df1.drop(columns=['drop'])
df2 = df2.drop(columns=['drop'])

print(df1)
print(df2)

CodePudding user response：

Try this:

from ast import literal_eval

df.list.apply(literal_eval)

CodePudding user response：

You can use map(len) to give the number of elements in a List in a column. So you could use:

df1 = df[df['list'].map(len) > 4]
df2 = df[df['list'].map(len) <= 4]

which gives the two sets of results you present

CodePudding user response：

Simply iterate through the first dataframe, get list length by counting nested lists in a recursive method and add the new corresponding rows to another dataframe:

import pandas as pd
def count_lists(l):
    return sum(1   count_lists(i) for i in l if isinstance(i,list))

data = {'name': ['kfjh', 'jhkg', 'khfg', 'khkjgr', 'kjrgjg', 'jkdgr', 'hgyr', 'jkgtjd', 'nkjgrd'],
        'list': [[['a','b','c'],['d','f','h'],['g','k','l']], [['a','b','c'],['d','f','h'],['g','k','l'],['f','k','j']],
                 [['a','b','c'],['g','k','l']], [['a','b','c'],['d','f','h']], [['d','f','h']],
                 [['a','b','c'],['d','f','h'],['g','k','l', ['g','j','l'],['f','l','p']]],
                 [['a','b','c'], ['d','kf','h'],['g','k','l', ['g','j','l'], ['f','l','p']]],[['f','l','p']],['t','t','i']]}
dframe = pd.DataFrame(data)
dframe1 = pd.DataFrame()
dframe2 = pd.DataFrame()
for i, j in dframe.iterrows():
    if count_lists(j)-1 > 4:
        dframe2 = dframe2.append(dframe.iloc[i])
    else:
        dframe1 = dframe1.append(dframe.iloc[i])
print("Dataframe1:\n", dframe1, "\n")
print("Dataframe2:\n", dframe2)

Result: