I have a dataframe where the column col
contains an arbitrarily long list of tuples (each tuple is of length 3). Consider the following example.
import numpy as np
import pandas as pd
df = \
pd.DataFrame(columns=['date', 'col'],
data = [["2009-09-01", "[(201, 0, 2), (206, 4, 6), (206, 6, 8)]"],
["2009-09-01", "[(206, 1, 3)]"],
["2009-09-01", np.NaN],
["2009-09-01", "[(201, 0, 2), (201, 4, 6, ), (202, 7, 9)]"],
["2009-09-01", "[(202, 0, 2), (203, 4, 6, ), (204, 7, 9)]"],
["2009-09-01", "[(202, 0, 2), (202, 4, 6, ), (202, 7, 9)]"]])
I want to extract how many times the first number in each tuple occurs in each row. The first number can be only one of the values (values: list = [201, 202, 203, 204, 205, 206]
) (e.g. 301
does not occur), and the numbers at position 1
and 2
in the tuple are irrelevant.
I now want to create six new columns, one for each value in values
, containing the number of occurrences of this number in position 0 of the tuples, e.g. column 201
contains 1
in row 0, since 201
is at position 0 of one tuple in that row.
Expected output is this:
# date col a b c d e f
# 0 2009-09-01 [(205, 0, 2), (206, 4, 6), (206, 6, 8)] 1 0 0 0 0 2
# 1 2009-09-01 [(206, 1, 3)] 0 0 0 0 0 1
# 2 2009-09-01 NaN 0 0 0 0 0 0
# 3 2009-09-01 [(201, 0, 2), (201, 4, 6, ), (202, 7, 9)] 2 1 0 0 0 0
# 4 2009-09-01 [(202, 0, 2), (203, 4, 6, ), (204, 7, 9)] 0 1 1 1 0 0
# 5 2009-09-01 [(202, 0, 2), (202, 4, 6, ), (202, 7, 9)] 0 3 0 0 0 0
I am thinking of something like this:
for v in values():
df[str(v)] = # number of occurrences of k at position 0 in col
I have experimented with different combinations of apply(lambda x: ...)
and str.get
without success; can someone give me a hint?
CodePudding user response:
Use ast.literal_eval
to convert the string representations of lists to lists. Then use Series.apply
to apply the logic row-wise to every element of 'col' column, and create the extra columns. Finally, use DataFrame.join
to add those extra columns to the original DataFrame
import ast
import pandas as pd
from collections import Counter
df = \
pd.DataFrame(columns=['date', 'col'],
data = [["2009-09-01", "[(201, 0, 2), (206, 4, 6), (206, 6, 8)]"],
["2009-09-01", "[(206, 1, 3)]"],
["2009-09-01", np.NaN],
["2009-09-01", "[(201, 0, 2), (201, 4, 6, ), (202, 7, 9)]"],
["2009-09-01", "[(202, 0, 2), (203, 4, 6, ), (204, 7, 9)]"],
["2009-09-01", "[(202, 0, 2), (202, 4, 6, ), (202, 7, 9)]"]])
# create a set instead of a list for O(1) membership test
values = {201, 202, 203, 204, 205, 206}
# to rename columns as in the example
column_mapper = dict(zip(sorted(values), "abcdef"))
def count_values(lst_str):
# initialize the counts to zero for every element of values
counts = Counter({n: 0 for n in values})
if pd.isna(lst_str):
return pd.Series(counts) # if it's Nan return zero counts
# convert string to list
lst = ast.literal_eval(lst_str)
# update the counts based on first element of each tuple
counts.update(x for x, *_ in lst if x in values)
return pd.Series(counts)
df = (
df.join(df['col'].apply(count_values))
.rename(columns=column_mapper) # optional columns renaming
)
Output
>>> df
date col a b c d e f
0 2009-09-01 [(201, 0, 2), (206, 4, 6), (206, 6, 8)] 1 0 0 0 0 2
1 2009-09-01 [(206, 1, 3)] 0 0 0 0 0 1
2 2009-09-01 NaN 0 0 0 0 0 0
3 2009-09-01 [(201, 0, 2), (201, 4, 6, ), (202, 7, 9)] 2 1 0 0 0 0
4 2009-09-01 [(202, 0, 2), (203, 4, 6, ), (204, 7, 9)] 0 1 1 1 0 0
5 2009-09-01 [(202, 0, 2), (202, 4, 6, ), (202, 7, 9)] 0 3 0 0 0 0
CodePudding user response:
Instead of iterating over all rows you could use apply
, which should be significantly faster:
import pandas as pd
import numpy as np
import ast
df = pd.DataFrame(columns=['date', 'col'],
data = [["2009-09-01", "[(201, 0, 2), (206, 4, 6), (206, 6, 8)]"],
["2009-09-01", "[(206, 1, 3)]"],
["2009-09-01", np.NaN],
["2009-09-01", "[(201, 0, 2), (201, 4, 6, ), (202, 7, 9)]"],
["2009-09-01", "[(202, 0, 2), (203, 4, 6, ), (204, 7, 9)]"],
["2009-09-01", "[(202, 0, 2), (202, 4, 6, ), (202, 7, 9)]"]])
values = [201, 202, 203, 204, 205, 206]
# converting strings to lists
df["col"] = df["col"].apply(lambda x: x if type(x)!= str else ast.literal_eval(x))
def get_index_from_list_of_tuples(l: list, v: int):
try:
return list(map(lambda t: t[0], l)).index(v) 1
except:
return 0
for v in values:
df[v] = df["col"].apply(lambda x: get_index_from_list_of_tuples(x, v))
Output:
date col 201 202 203 204 205 206
0 2009-09-01 [(201, 0, 2), (206, 4, 6), (206, 6, 8)] 1 0 0 0 0 2
1 2009-09-01 [(206, 1, 3)] 0 0 0 0 0 1
2 2009-09-01 NaN 0 0 0 0 0 0
3 2009-09-01 [(201, 0, 2), (201, 4, 6), (202, 7, 9)] 1 3 0 0 0 0
4 2009-09-01 [(202, 0, 2), (203, 4, 6), (204, 7, 9)] 0 1 2 3 0 0
5 2009-09-01 [(202, 0, 2), (202, 4, 6), (202, 7, 9)] 0 1 0 0 0 0
Note: you can move literal_eval
inside the function for more efficiency. It's not clear why you'd keep your lists as strings though.