I have a dataset similar to the below that is spatial. Has an "ID", "Assay" that comes from an analytical machine, From interval and To interval. I want to have it look by ID and Assay as it goes down from top to bottom, and find the repeated Assay values and lump them if they are repeating (right after the other). I tried using groupby and aggregate but ended up with it lumping any Assay value that was similar together, I only want it to put together if its back to back. Hopefully the example below makes sense. Thanks in advance! The result is what I want, but the code won't get me that.
import pandas as pd
df = pd.DataFrame({
"ID": [ 1, 1, 1, 1, 2, 2, 3, 3, 5, 5, 5, 5],
"Assay": [ 3, 3, 4, 3, 3, 6, 4, 4, 1, 1, 2, 2],
"From": [ 7, 8, 9,10, 0, 8,12,15, 0, 5,10,15],
"To": [13,14,15,16,17,18,13,100,5,10,15,25]
})
result = df.groupby(["ID", "Assay"]).agg({"From":['first'], "To":['last']})
Expected output:
From To
first last
ID Assay
1 3 7 14
4 9 15
3 10 16
2 3 0 17
6 8 18
3 4 12 100
5 1 0 5
1 5 10
2 10 15
2 15 25
CodePudding user response:
We could use diff
ne
cumsum
to create groups out of consecutive Assays; then filter the Assays that are greater than 3 and use groupby.agg
for aggregation.
Then concatenate this result with the rows that were filtered for the final output:
df['groups'] = df['Assay'].diff().ne(0).cumsum()
msk = df['Assay'].ge(3)
tmp = (df[msk].groupby(['ID','Assay', 'groups'], sort=False)
.agg({'From':'first', 'To':'last'}).reset_index())
out = pd.concat((tmp, df[~msk])).sort_values('groups').drop(columns='groups').reset_index(drop=True)
Output:
ID Assay From To
0 1 3 7 14
1 1 4 9 15
2 1 3 10 16
3 2 3 0 17
4 2 6 8 18
5 3 4 12 100
6 5 1 0 5
7 5 1 5 10
8 5 2 10 15
9 5 2 15 25
CodePudding user response:
itertools.groupby can help you with it or you can take a look at convtools based solution:
from convtools.contrib.tables import Table
from convtools import conversion as c
iter_rows = Table.from_csv("input.csv", header=True).into_iter_rows(dict)
# store the converter in a variable for further reuse;
# this is a normal ad hoc function
converter = (
c.chunk_by(c.item("ID"), c.item("Assay"))
.aggregate(
{
"ID": c.ReduceFuncs.First(c.item("ID")),
"Assay": c.ReduceFuncs.First(c.item("Assay")),
"From first": c.ReduceFuncs.First(c.item("From")),
"To last": c.ReduceFuncs.Last(c.item("To")),
}
)
.gen_converter()
)
iter_new_rows = converter(iter_rows)
assert list(iter_new_rows) == [
{'ID': '1', 'Assay': '3', 'From first': '7', 'To last': '14'},
{'ID': '1', 'Assay': '4', 'From first': '9', 'To last': '15'},
{'ID': '1', 'Assay': '3', 'From first': '10', 'To last': '16'},
{'ID': '2', 'Assay': '3', 'From first': '0', 'To last': '17'},
{'ID': '2', 'Assay': '6', 'From first': '8', 'To last': '18'},
{'ID': '3', 'Assay': '4', 'From first': '12', 'To last': '100'},
{'ID': '5', 'Assay': '1', 'From first': '0', 'To last': '10'},
{'ID': '5', 'Assay': '2', 'From first': '10', 'To last': '25'}]
# # or if a csv file is needed
# Table.from_rows(iter_new_rows).into_csv("output.csv")