Home > Software design >  In a column of a dataframe, count number of elements on list starting by "a"
In a column of a dataframe, count number of elements on list starting by "a"

Time:04-29

I have a dataset like

data = {'ID':  ['first_value', 'second_value', 'third_value',
                'fourth_value', 'fifth_value', 'sixth_value'],
        'list_id': [['001', 'ab0', '44A'], [], ['005', '006'],
                    ['a22'], ['azz'], ['aaa', 'abd']]
        }
df = pd.DataFrame(data)

And I want to create two columns:

  1. A column that counts the number of elements that start with "a" on 'list_id'
  2. A column that counts the number of elements that DO NOT start with "a" on "list_id"

I was thinking on doing something like:

data['list_id'].apply(lambda x: for entity in x if x.startswith("a") 

I thought on counting first the ones starting with “a” and after counting the ones not starting with “a”, so I did this:

sum(1 for w in data["list_id"] if w.startswith('a'))   

Moreover this does not really work and I cannot make it work.
Any ideas? :)

CodePudding user response:

Assuming this input:

             ID          list_id
0   first_value  [001, ab0, 44A]
1  second_value               []
2   third_value       [005, 006]
3  fourth_value            [a22]
4   fifth_value            [azz]
5   sixth_value       [aaa, abd]

you can use:

sum(1 for l in data['list_id'] for x in l if x.startswith('a'))

output: 5

If you rather want a count per row:

df['starts_with_a'] = [sum(x.startswith('a') for x in l) for l in df['list_id']]
df['starts_with_other'] = df['list_id'].str.len()-df['starts_with_a']

NB. using a list comprehension is faster than apply

output:

             ID          list_id  starts_with_a  starts_with_other
0   first_value  [001, ab0, 44A]              1                  2
1  second_value               []              0                  0
2   third_value       [005, 006]              0                  2
3  fourth_value            [a22]              1                  0
4   fifth_value            [azz]              1                  0
5   sixth_value       [aaa, abd]              2                  0

CodePudding user response:

Using pandas something quite similar to your proposal works:

data = {'ID':  ['first_value', 'second_value', 'third_value', 'fourth_value', 'fifth_value', 'sixth_value'],
        'list_id': [['001', 'ab0', '44A'], [], ['005', '006'], ['a22'], ['azz'], ['aaa', 'abd']]
        }

df = pd.DataFrame(data)

df["len"] = df.list_id.apply(len)

df["num_a"] = df.list_id.apply(lambda s: sum(map(lambda x: x[0] == "a", s)))
df["num_not_a"] = df["len"] - df["num_a"]
  • Related