Home > Enterprise >  Get a 'history' of items from JSON column of list of dictionaries
Get a 'history' of items from JSON column of list of dictionaries

Time:04-18

I have a dataframe that has the following format (simplified):

userid is a string, purchased_items is a JSON string (in the format of a list of dictionaries).

userid purchased_items
U123 [{"itemid":"I456","t":1649722121709}, {"itemid":"I789","t":1649722118064}, {itemid:"I123", "t": 1649722034610}]
U456 [{"itemid":"I123","t":1649721903646}]

where itemid is what the user purchased, 't' is the timestamp at which it occurred (the dictionaries are sorted according to descending order of timestamp, so the earliest timestamps are first.).

What I want: For each user and timestamp, get a list of all the items that was purchased BEFORE and DURING the current timestamp

userid timestamp previously purchased items
U123 1649722121709 I456
U123 1649722118064 I456 I789
U123 1649722034610 I456 I789 I123
U456 1649721903646 I123

Edit-What I've Tried So I've tried to broken down the problem by row first, and plan to create a function from the code below and apply it to each row.

visits = df[df['userid']=='U123']['purchased_items'].values[0] #visits is of type list
dict_of_rows = {}
history = []
for i in visits:
  if i is not None:
    obj = json.loads(i)
    dict_of_rows['timestamp'] = obj['t']
    dict_of_rows['history'] = history
    history.append(obj['itemid'])

What I get:

dict_of_rows
{'history': ['I456', 'I789', 'I123'],
 'timestamp': 1649722034610}

This gives me only one row of what I want (row3 of the output table), I want to be able to get the first 3 rows of output table.

And my plan is to use this dictionary of rows to build another dataframe that gives the output table.

CodePudding user response:

import json
import pandas as pd

df = pd.read_csv(StringIO('''userid|purchased_items
U123|[{"itemid":"I456","t":1649722121709}, {"itemid":"I789","t":1649722118064}, {"itemid":"I123", "t": 1649722034610}]
U456|[{"itemid":"I123","t":1649721903646}]'''), delimiter='|')


df['purchased_items'] = df['purchased_items'].apply(lambda x: [list(i.values()) for i in json.loads(x)])
df = df.explode(column='purchased_items')

df['timestamp'] = df['purchased_items'].str[1]
df['previously purchased items'] = df['purchased_items'].str[0]


df['previously purchased items'] = df.groupby('userid')['previously purchased items']\
                .transform(lambda x: pd.concat((x.shift(n) for n in range(x.shape[0])), axis=1, ignore_index=True).values.tolist())

df.drop(columns=['purchased_items'], inplace=True)

df
userid timestamp previously purchased items
0 U123 1649722121709 ['I456', nan, nan]
0 U123 1649722118064 ['I789', 'I456', nan]
0 U123 1649722034610 ['I123', 'I789', 'I456']
1 U456 1649721903646 ['I123']

CodePudding user response:

You can convert original list of dictionary purchased_items column to multiple columns with explode and pd.json_normalize.

df = df.explode('purchased_items').reset_index()
df = pd.concat([df['userid'], pd.json_normalize(df['purchased_items'])], axis=1).sort_values(['userid', 't'], ascending=[True, False])
print(df)

  userid itemid              t
0   U123   I456  1649722121709
1   U123   I789  1649722118064
2   U123   I123  1649722034610
3   U456   I123  1649721903646

Then you can group by userid column then join previous row of current row in each group by apply on group rows.

out = df.groupby('userid').apply(lambda group: group.apply(lambda row: ' '.join(group.reset_index().loc[:row.name, 'itemid']).strip(), axis=1)).reset_index(drop=True)
print(out)

0              I456
1         I456 I789
2    I456 I789 I123
3              I123

Finally, add the returned Series to original dataframe

df['previously purchased items'] = out
df = df.drop(columns=['itemid']).rename(columns={'t': 'timestamp'})
print(df_)

  userid      timestamp previously purchased items
0   U123  1649722121709                       I456
1   U123  1649722118064                  I456 I789
2   U123  1649722034610             I456 I789 I123
3   U456  1649721903646                       I123

If you want to exclude current purchased item, you can change the groupby to following

out = df.groupby('userid').apply(lambda group: group.assign(temp=group['itemid'].shift(fill_value='')).pipe(lambda g: g.reset_index().apply(lambda row: ' '.join(g.loc[:row.name, 'temp']).strip(), axis=1))).reset_index(drop=True)
print(df)

  userid      timestamp previously purchased items
0   U123  1649722121709
1   U123  1649722118064                       I456
2   U123  1649722034610                  I456 I789
3   U456  1649721903646
  • Related