I am writing a python program that will parse a large dataframe (tens of thousands of lines) into smaller dataframes based on a column value, and it needs to be fairly efficient, because the user can change the ways they break up the dataframe, and I would like the output to update dynamically.
Example input:
id | Column_1 | Column_2 |
---|---|---|
1 | Oct | 10000$ |
1 | Dec | 9000$ |
2 | Oct | 3400$ |
3 | Dec | 20000$ |
2 | Nov | 9000$ |
1 | Nov | 15000$ |
Example Output:
id | Column_1 | Column_2 |
---|---|---|
1 | Oct | 10000$ |
1 | Nov | 15000$ |
1 | Dec | 9000$ |
id | Column_1 | Column_2 |
---|---|---|
2 | Oct | 3400$ |
2 | Nov | 9000$ |
id | Column_1 | Column_2 |
---|---|---|
3 | Dec | 20000$ |
The naïve way, in my mind, is to do something like this:
for id in list(df['id'].unique()):
filtered_df = df[df['id'] == id]
But I believe this would be looping over the same data more times than is necessary, which is inefficient. Is there a fast way of doing this?
Update
Did a little software drag racing. Here are the results:
%%timeit
[df.loc[df.id.eq(i)] for i in df.id.unique()]
9.96 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
dflist=[]
dflist2=[]
for k,v in df.groupby(['id']):
var='id' str(k)
dflist.append(var)
globals()[var] = v
dflist2.append(v)
1.28 ms ± 92.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
d = {id:df[df.id==id] for id in df.id.unique()}
9.19 ms ± 885 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Does anyone know why the second solution would be so much faster?
CodePudding user response:
here is one way to do it
# create lists to hold name and the trimmed data from df
dflist=[]
dflist2=[]
for k,v in df.groupby(['id']):
var='id' str(k)
# append dataframe name in a list
dflist.append(var)
# create a DF with a variable name
globals()[var] = v
# append dataframe in a list
dflist2.append(v)
# to list all the dataframe created
>>> dflist
['id1', 'id2', 'id3']
# accessing a dataframe
>>> id1
id Column_1 Column_2
0 1 Oct 10000$
1 1 Dec 9000$
5 1 Nov 15000$
>> dflist2
[ id Column_1 Column_2
0 1 Oct 10000$
1 1 Dec 9000$
5 1 Nov 15000$,
id Column_1 Column_2
2 2 Oct 3400$
4 2 Nov 9000$,
id Column_1 Column_2
3 3 Dec 20000$]
CodePudding user response:
A possible solution:
[df.loc[df.id.eq(i)] for i in df.id.unique()]
Output:
[
id Column_1 Column_2
0 1 Oct 10000$
1 1 Dec 9000$
5 1 Nov 15000$,
id Column_1 Column_2
2 2 Oct 3400$
4 2 Nov 9000$,
id Column_1 Column_2
3 3 Dec 20000$
]
CodePudding user response:
You can use dictionary comprehension for that:
Input dataframe df
:
d = {id:df[df.id==id] for id in df.id.unique()}
Use d[1]
to get the (sub)dataframe for ID 1
.
Use d[2]
to get the (sub)dataframe for ID 2
.
Use d[3]
to get the (sub)dataframe for ID 3
.
d[2]
Result:
id Column_1 Column_2
2 2 Oct 3400
4 2 Nov 9000