I have a pandas timeseries such as:
df = pd.DataFrame(
{
"IDs": [
["A", "B"],
["A", "B", "A", "C"],
["A", "B", "A", "D", "C", "D"],
],
"distance": [[1], [1.1, 2.8], [1, 3, 0.5]],
},
index=[
pd.to_datetime("2022-01-01 12:00:00"),
pd.to_datetime("2022-01-01 12:00:01"),
pd.to_datetime("2022-01-01 12:00:02"),
],
)
This produces:
IDs distance
2022-01-01 12:00:00 [A, B] [1]
2022-01-01 12:00:01 [A, B, A, C] [1.1, 2.8]
2022-01-01 12:00:02 [A, B, A, D, C, D] [1, 3, 0.5]
And I would like to split the rows so that each row only contains the distance to a single pair such as:
IDs distance
2022-01-01 12:00:00 [A, B] 1.0
2022-01-01 12:00:01 [A, B] 1.1
2022-01-01 12:00:01 [A, C] 2.8
2022-01-01 12:00:02 [A, B] 1.0
2022-01-01 12:00:02 [A, D] 3.0
2022-01-01 12:00:02 [C, D] 0.5
What would be the best way to do this?
For information, each row IDs
rows are always a list of 2n elements where n is the number of elements in the corresponding distance
row.
CodePudding user response:
IIUC, group the IDs by chunks of 2 using a list comprehension, then explode
the two IDs/distance columns:
df['IDs'] = [[l[i:i 2] for i in range(0,len(l),2)] for l in df['IDs']]
df = df.explode(['IDs', 'distance'])
NB. this requires len(IDs) to be 2 times len(distance) for each row!
output:
IDs distance
2022-01-01 12:00:00 [A, B] 1
2022-01-01 12:00:01 [A, B] 1.1
2022-01-01 12:00:01 [A, C] 2.8
2022-01-01 12:00:02 [A, B] 1
2022-01-01 12:00:02 [A, D] 3
2022-01-01 12:00:02 [C, D] 0.5
CodePudding user response:
Similar to mozway's answer, only uses numpy reshape functionality to split up the values in the IDs
lists:
df = df.assign(IDs=[np.array(l).reshape(-1, 2) for l in df['IDs'].tolist()]).explode(['IDs', 'distance'])
Output:
>>> df
IDs distance
2022-01-01 12:00:00 [A, B] 1
2022-01-01 12:00:01 [A, B] 1.1
2022-01-01 12:00:01 [A, C] 2.8
2022-01-01 12:00:02 [A, B] 1
2022-01-01 12:00:02 [A, D] 3
2022-01-01 12:00:02 [C, D] 0.5
CodePudding user response:
Similar to mozway's solution but letting NumPy do the reshaping:
df.assign(IDs=df['IDs'].apply(lambda l: np.array(l).reshape(-1, 2))).explode(['IDs', 'distance'])