First timer posting here and new to Python, so apologies in advance if I am missing any key information below.
Essentially, I have a large CSV file that I was able to clean up a bit on scripts that contains various numerical values over ~150 miles of data with each data line being one foot. After I clean the file up a bit, tables would typically look like something below:
ABC Mile Ft Param1
A 1 1000 0.1234
A 1 1001 0.1111
A 1 1002 0.1221
A 1 1003 0.1511
B 1 1004 0.1999
B 1 1005 0.2011
B 1 1006 0.1878
B 1 1007 0.1999
C 1 1008 0.5321
C 1 1009 0.5333
C 1 1010 0.5445
C 1 1011 0.5655
C 1 1012 0.5852
A 1 1013 0.2788
A 1 1014 0.2899
A 1 1015 0.2901
A 1 1016 0.2921
A 1 1017 0.2877
A 1 1018 0.2896
For this file, the 'ABC' column will always only equal A, B, or C.
What I am trying to do is average the Param1 numbers for each set of A, B, and C. Thus in the example above, I would be looking to get the average of Param1 when it equals A from Ft 1000 to 1003, when it equals B from Ft 1004 to 1007, when it equals C from Ft 1008 to 1012, when it equals A from 1013 to 1018 and so on for the rest of the file.
Edit I should also mention that in these files, ABC will equal the same value typically for several hundred rows until it equals another value that will again repeat for several hundred rows, and so on. So the 'ABC' column could values could be something like this:
AAA...AAA BBB...BBB CCC...CCC BBB...BBB AAA...AAA
I have been looking at use of a for loop as below, but the problem is that I get all the averages of Param1 when equals A over a full mile, not each grouping. This is what I have thus far:
for i in range(1,df['Mile'].max()):
avg_p1 = df.loc[(df['Mile'] == i) & (df['ABC'] =='A'), 'Param1'].mean()
print(avg_p1)
But in this case, I get the average of Param1 when ABC = A over the full mile. In the table example above, I want the average of Param1 when ABC = A from Ft 1000 to 1003 and 1013 to 1018, as separate averages repeated through the whole document.
Would there need to be a second for loop or some kind of if/else condition added to the existing loop above? Any help for this novice programmer would be much appreciated :)
CodePudding user response:
Thank you for this interesting question.
The idea is to create a group for each continuous value 'A', 'B', or 'C' until it changes. I also assume that your data is already sorted by mile
df['change'] = np.where(df['ABC']!=df['ABC'].shift(1),1.0,0.0)
Now you simply cumsum
to create new group indicator
df['gr'] = df['change'].cumsum()
Everything should be fine now and you can use groupby
to get what you want
df.groupby('gr')['Param1'].mean().reset_index()
CodePudding user response:
df.groupby('ABC')['Ft'].mean()
output:
ABC
A 1009.9
B 1005.5
C 1010.0
Name: Ft, dtype: float64
CodePudding user response:
First, get a list of the bins for each category, then you can do the average by category and bin... something like this:
results = {}
for cat in df['ABC'].unique():
results[cat] = []
category_index = df[df['ABC'] == cat].index.to_series()
# Get list of continuous indexes
bins = category_index.groupby(
category_index.diff().ne(1).cumsum()
).agg(['first','last']).apply(tuple,1).tolist()
# Average by category and bin
for bin in bins:
bin_low, bin_high = bin
df_cut = df.iloc[bin_low:bin_high]
low_ft, high_ft = df.iloc[bin_low]['Ft'], df.iloc[bin_high]['Ft']
average_value = df_cut.groupby('ABC').mean()['Param1'][cat]
results[cat].append(((low_ft, high_ft), average_value))
results
Output:
{
'A': [
((1000, 1003), 0.11886666666666668),
((1013, 1018), 0.28772000000000003)
],
'B': [
((1004, 1007), 0.19626666666666667)
],
'C': [
((1008, 1012), 0.54385)
]
}