I have a dataframe that has the following format (simplified):
userid
is a string, purchased_items
is a JSON string (in the format of a list of dictionaries).
userid | purchased_items |
---|---|
U123 | [{"itemid":"I456","t":1649722121709}, {"itemid":"I789","t":1649722118064}, {itemid:"I123", "t": 1649722034610}] |
U456 | [{"itemid":"I123","t":1649721903646}] |
where itemid is what the user purchased, 't' is the timestamp at which it occurred (the dictionaries are sorted according to descending order of timestamp, so the earliest timestamps are first.).
What I want: For each user and timestamp, get a list of all the items that was purchased BEFORE and DURING the current timestamp
userid | timestamp | previously purchased items |
---|---|---|
U123 | 1649722121709 | I456 |
U123 | 1649722118064 | I456 I789 |
U123 | 1649722034610 | I456 I789 I123 |
U456 | 1649721903646 | I123 |
Edit-What I've Tried So I've tried to broken down the problem by row first, and plan to create a function from the code below and apply it to each row.
visits = df[df['userid']=='U123']['purchased_items'].values[0] #visits is of type list
dict_of_rows = {}
history = []
for i in visits:
if i is not None:
obj = json.loads(i)
dict_of_rows['timestamp'] = obj['t']
dict_of_rows['history'] = history
history.append(obj['itemid'])
What I get:
dict_of_rows
{'history': ['I456', 'I789', 'I123'],
'timestamp': 1649722034610}
This gives me only one row of what I want (row3 of the output table), I want to be able to get the first 3 rows of output table.
And my plan is to use this dictionary of rows to build another dataframe that gives the output table.
CodePudding user response:
import json
import pandas as pd
df = pd.read_csv(StringIO('''userid|purchased_items
U123|[{"itemid":"I456","t":1649722121709}, {"itemid":"I789","t":1649722118064}, {"itemid":"I123", "t": 1649722034610}]
U456|[{"itemid":"I123","t":1649721903646}]'''), delimiter='|')
df['purchased_items'] = df['purchased_items'].apply(lambda x: [list(i.values()) for i in json.loads(x)])
df = df.explode(column='purchased_items')
df['timestamp'] = df['purchased_items'].str[1]
df['previously purchased items'] = df['purchased_items'].str[0]
df['previously purchased items'] = df.groupby('userid')['previously purchased items']\
.transform(lambda x: pd.concat((x.shift(n) for n in range(x.shape[0])), axis=1, ignore_index=True).values.tolist())
df.drop(columns=['purchased_items'], inplace=True)
df
userid | timestamp | previously purchased items | |
---|---|---|---|
0 | U123 | 1649722121709 | ['I456', nan, nan] |
0 | U123 | 1649722118064 | ['I789', 'I456', nan] |
0 | U123 | 1649722034610 | ['I123', 'I789', 'I456'] |
1 | U456 | 1649721903646 | ['I123'] |
CodePudding user response:
You can convert original list of dictionary purchased_items
column to multiple columns with explode
and pd.json_normalize
.
df = df.explode('purchased_items').reset_index()
df = pd.concat([df['userid'], pd.json_normalize(df['purchased_items'])], axis=1).sort_values(['userid', 't'], ascending=[True, False])
print(df)
userid itemid t
0 U123 I456 1649722121709
1 U123 I789 1649722118064
2 U123 I123 1649722034610
3 U456 I123 1649721903646
Then you can group by userid
column then join previous row of current row in each group by apply
on group rows.
out = df.groupby('userid').apply(lambda group: group.apply(lambda row: ' '.join(group.reset_index().loc[:row.name, 'itemid']).strip(), axis=1)).reset_index(drop=True)
print(out)
0 I456
1 I456 I789
2 I456 I789 I123
3 I123
Finally, add the returned Series to original dataframe
df['previously purchased items'] = out
df = df.drop(columns=['itemid']).rename(columns={'t': 'timestamp'})
print(df_)
userid timestamp previously purchased items
0 U123 1649722121709 I456
1 U123 1649722118064 I456 I789
2 U123 1649722034610 I456 I789 I123
3 U456 1649721903646 I123
If you want to exclude current purchased item, you can change the groupby
to following
out = df.groupby('userid').apply(lambda group: group.assign(temp=group['itemid'].shift(fill_value='')).pipe(lambda g: g.reset_index().apply(lambda row: ' '.join(g.loc[:row.name, 'temp']).strip(), axis=1))).reset_index(drop=True)
print(df)
userid timestamp previously purchased items
0 U123 1649722121709
1 U123 1649722118064 I456
2 U123 1649722034610 I456 I789
3 U456 1649721903646