python: list generated from dataframe in pandas is much longer than the dataframe column-CodePudding

this code should generate a list called 'cat_list' of values taken from df['a'] according to their position in the 'cat' list. If df['a'] contains values not present in 'cat' list, then 0 should be appended to 'cat_list'. 'cat_list' should have length 6, but I am not sure why its length is 18.

import pandas as pd

d = {'a': [0.1, 0.2,0.3,0.4,0.5,0.6], 'b': [0.6, 0.8,0.3,0.4,0.1,0.1],
     'c': [0.7, 0.3,0.9,0.4,1.0,0.2],'d': [1,0,0,1,0,1]}
df = pd.DataFrame(data=d)

cat=[0.6,0.3,0.1]
cat_list=[]
for i in df.a:
    for j in cat:
        if i == j:
            cat_list.append(cat.index(j))
        else:
            cat_list.append(0)

print(cat_list) # should print [2,0,1,0,0,0]
print(len(cat_list)) # should print 6, not 18

CodePudding user response：

Length wise you have a loop over 3 elements inside a loop over 6 elements. Together that is going to result in 6*3=18 elements.

Each run of the inner loop you append to cat_list rather than just when the item is found or once when it isn't. I believe this is what you are trying to do:

import pandas as pd

d = {'a': [0.1, 0.2,0.3,0.4,0.5,0.6], 'b': [0.6, 0.8,0.3,0.4,0.1,0.1],
     'c': [0.7, 0.3,0.9,0.4,1.0,0.2],'d': [1,0,0,1,0,1]}
df = pd.DataFrame(data=d)

cat=[0.6,0.3,0.1]
cat_list=[]
for i in df.a:
    found_in_cat=False
    for j in cat:
        if i == j:
            cat_list.append(cat.index(j))
     if not found_in_cat:
        cat_list.append(0)

print(cat_list) # should print [2,0,1,0,0,0]
print(len(cat_list)) # should print 6, not 18

I would, however, write it like the following:

import pandas as pd

d = {'a': [0.1, 0.2,0.3,0.4,0.5,0.6], 'b': [0.6, 0.8,0.3,0.4,0.1,0.1],
     'c': [0.7, 0.3,0.9,0.4,1.0,0.2],'d': [1,0,0,1,0,1]}
df = pd.DataFrame(data=d)

cat=[0.6,0.3,0.1]
cat_list=[]
for i in df.a:
    if i in cat:
        cat_list.append(cat.index(i))
    else:
        cat_list.append(0)

print(cat_list) # should print [2,0,1,0,0,0]
print(len(cat_list)) # should print 6, not 18

CodePudding user response：

It is usually inefficient to use loops with dataframes.

You could use map on the column "a" with a crafted defaultdict, this will ensure to map 0 when the value is not found:

from collections import defaultdict
val = defaultdict(lambda :0, zip(cat, range(len(cat))))
df['a'].map(val).tolist()

output: [2, 0, 1, 0, 0, 0]

Altenatively, you could use a list comprehension and a classical dictionary, the use of get enable to set a default value when the key is missing:

val = dict(zip(cat, range(len(cat))))
[val.get(e, 0) for e in df['a'].values]

format of the used dictionary/defaultdict:

>>> val
{0.6: 0, 0.3: 1, 0.1: 2}