I have a dataset like
data = {'ID': ['first_value', 'second_value', 'third_value',
'fourth_value', 'fifth_value', 'sixth_value'],
'list_id': [['001', 'ab0', '44A'], [], ['005', '006'],
['a22'], ['azz'], ['aaa', 'abd']]
}
df = pd.DataFrame(data)
And I want to create two columns:
- A column that counts the number of elements that start with "a" on 'list_id'
- A column that counts the number of elements that DO NOT start with "a" on "list_id"
I was thinking on doing something like:
data['list_id'].apply(lambda x: for entity in x if x.startswith("a")
I thought on counting first the ones starting with “a” and after counting the ones not starting with “a”, so I did this:
sum(1 for w in data["list_id"] if w.startswith('a'))
Moreover this does not really work and I cannot make it work.
Any ideas? :)
CodePudding user response:
Assuming this input:
ID list_id
0 first_value [001, ab0, 44A]
1 second_value []
2 third_value [005, 006]
3 fourth_value [a22]
4 fifth_value [azz]
5 sixth_value [aaa, abd]
you can use:
sum(1 for l in data['list_id'] for x in l if x.startswith('a'))
output: 5
If you rather want a count per row:
df['starts_with_a'] = [sum(x.startswith('a') for x in l) for l in df['list_id']]
df['starts_with_other'] = df['list_id'].str.len()-df['starts_with_a']
NB. using a list comprehension is faster than apply
output:
ID list_id starts_with_a starts_with_other
0 first_value [001, ab0, 44A] 1 2
1 second_value [] 0 0
2 third_value [005, 006] 0 2
3 fourth_value [a22] 1 0
4 fifth_value [azz] 1 0
5 sixth_value [aaa, abd] 2 0
CodePudding user response:
Using pandas something quite similar to your proposal works:
data = {'ID': ['first_value', 'second_value', 'third_value', 'fourth_value', 'fifth_value', 'sixth_value'],
'list_id': [['001', 'ab0', '44A'], [], ['005', '006'], ['a22'], ['azz'], ['aaa', 'abd']]
}
df = pd.DataFrame(data)
df["len"] = df.list_id.apply(len)
df["num_a"] = df.list_id.apply(lambda s: sum(map(lambda x: x[0] == "a", s)))
df["num_not_a"] = df["len"] - df["num_a"]