extract each first item from list of tuples in pandas column-CodePudding

I have a dataframe where the column col contains an arbitrarily long list of tuples (each tuple is of length 3). Consider the following example.

import numpy as np
import pandas as pd

df = \
    pd.DataFrame(columns=['date',         'col'],
                  data = [["2009-09-01",  "[(201, 0, 2), (206, 4, 6), (206, 6, 8)]"],
                          ["2009-09-01",  "[(206, 1, 3)]"],
                          ["2009-09-01",  np.NaN],
                          ["2009-09-01",  "[(201, 0, 2), (201, 4, 6, ), (202, 7, 9)]"],
                          ["2009-09-01",  "[(202, 0, 2), (203, 4, 6, ), (204, 7, 9)]"],
                          ["2009-09-01",  "[(202, 0, 2), (202, 4, 6, ), (202, 7, 9)]"]])

I want to extract how many times the first number in each tuple occurs in each row. The first number can be only one of the values (values: list = [201, 202, 203, 204, 205, 206]) (e.g. 301 does not occur), and the numbers at position 1 and 2 in the tuple are irrelevant.

I now want to create six new columns, one for each value in values, containing the number of occurrences of this number in position 0 of the tuples, e.g. column 201 contains 1 in row 0, since 201 is at position 0 of one tuple in that row.

Expected output is this:

#          date                                        col   a  b  c  d  e  f
# 0  2009-09-01    [(205, 0, 2), (206, 4, 6), (206, 6, 8)]   1  0  0  0  0  2
# 1  2009-09-01                              [(206, 1, 3)]   0  0  0  0  0  1
# 2  2009-09-01                                        NaN   0  0  0  0  0  0
# 3  2009-09-01  [(201, 0, 2), (201, 4, 6, ), (202, 7, 9)]   2  1  0  0  0  0
# 4  2009-09-01  [(202, 0, 2), (203, 4, 6, ), (204, 7, 9)]   0  1  1  1  0  0
# 5  2009-09-01  [(202, 0, 2), (202, 4, 6, ), (202, 7, 9)]   0  3  0  0  0  0

I am thinking of something like this:

for v in values():
    df[str(v)] = # number of occurrences of k at position 0 in col

I have experimented with different combinations of apply(lambda x: ...) and str.get without success; can someone give me a hint?

CodePudding user response：

Use ast.literal_eval to convert the string representations of lists to lists. Then use Series.apply to apply the logic row-wise to every element of 'col' column, and create the extra columns. Finally, use DataFrame.join to add those extra columns to the original DataFrame

import ast 
import pandas as pd
from collections import Counter

df = \
    pd.DataFrame(columns=['date',         'col'],
                  data = [["2009-09-01",  "[(201, 0, 2), (206, 4, 6), (206, 6, 8)]"],
                          ["2009-09-01",  "[(206, 1, 3)]"],
                          ["2009-09-01",  np.NaN],
                          ["2009-09-01",  "[(201, 0, 2), (201, 4, 6, ), (202, 7, 9)]"],
                          ["2009-09-01",  "[(202, 0, 2), (203, 4, 6, ), (204, 7, 9)]"],
                          ["2009-09-01",  "[(202, 0, 2), (202, 4, 6, ), (202, 7, 9)]"]])

# create a set instead of a list for O(1) membership test
values = {201, 202, 203, 204, 205, 206}
# to rename columns as in the example
column_mapper = dict(zip(sorted(values), "abcdef"))

def count_values(lst_str):
    # initialize the counts to zero for every element of values
    counts = Counter({n: 0 for n in values})  

    if pd.isna(lst_str):
        return pd.Series(counts)  # if it's Nan return zero counts
    # convert string to list 
    lst = ast.literal_eval(lst_str)  
    # update the counts based on first element of each tuple 
    counts.update(x for x, *_ in lst if x in values)
    return pd.Series(counts)


df = (
    df.join(df['col'].apply(count_values))
      .rename(columns=column_mapper)  # optional columns renaming
)

Output

>>> df

         date                                        col  a  b  c  d  e  f
0  2009-09-01    [(201, 0, 2), (206, 4, 6), (206, 6, 8)]  1  0  0  0  0  2
1  2009-09-01                              [(206, 1, 3)]  0  0  0  0  0  1
2  2009-09-01                                        NaN  0  0  0  0  0  0
3  2009-09-01  [(201, 0, 2), (201, 4, 6, ), (202, 7, 9)]  2  1  0  0  0  0
4  2009-09-01  [(202, 0, 2), (203, 4, 6, ), (204, 7, 9)]  0  1  1  1  0  0
5  2009-09-01  [(202, 0, 2), (202, 4, 6, ), (202, 7, 9)]  0  3  0  0  0  0

CodePudding user response：

Instead of iterating over all rows you could use apply, which should be significantly faster:

import pandas as pd
import numpy as np
import ast

df = pd.DataFrame(columns=['date',         'col'],
                  data = [["2009-09-01",  "[(201, 0, 2), (206, 4, 6), (206, 6, 8)]"],
                          ["2009-09-01",  "[(206, 1, 3)]"],
                          ["2009-09-01",  np.NaN],
                          ["2009-09-01",  "[(201, 0, 2), (201, 4, 6, ), (202, 7, 9)]"],
                          ["2009-09-01",  "[(202, 0, 2), (203, 4, 6, ), (204, 7, 9)]"],
                          ["2009-09-01",  "[(202, 0, 2), (202, 4, 6, ), (202, 7, 9)]"]])

values = [201, 202, 203, 204, 205, 206]
# converting strings to lists
df["col"] = df["col"].apply(lambda x: x if type(x)!= str else ast.literal_eval(x))

def get_index_from_list_of_tuples(l: list, v: int):
    try:
        return list(map(lambda t: t[0], l)).index(v)   1
    except:
        return 0

for v in values:
    df[v] = df["col"].apply(lambda x: get_index_from_list_of_tuples(x, v))

Output:

         date                                      col  201  202  203  204  205  206
0  2009-09-01  [(201, 0, 2), (206, 4, 6), (206, 6, 8)]    1    0    0    0    0    2
1  2009-09-01                            [(206, 1, 3)]    0    0    0    0    0    1
2  2009-09-01                                      NaN    0    0    0    0    0    0
3  2009-09-01  [(201, 0, 2), (201, 4, 6), (202, 7, 9)]    1    3    0    0    0    0
4  2009-09-01  [(202, 0, 2), (203, 4, 6), (204, 7, 9)]    0    1    2    3    0    0
5  2009-09-01  [(202, 0, 2), (202, 4, 6), (202, 7, 9)]    0    1    0    0    0    0

Note: you can move literal_eval inside the function for more efficiency. It's not clear why you'd keep your lists as strings though.