I have the following list of numbers, which are random:
numbers = [1, 3, 5, 5, 2, 4, 1, 5, 4, 5, 2, 2]
For each number (1, 2, 3, 4, 5) I want to know the mean of the numbers that follow it.
Here is an example:
1 appears two times, at positions 0 and 6 in the list.
At position 0, it is immediately followed by the number 3 (at position 1) and at position 6 it is followed by the number 5 (at position 7).
So 1 appears two times and is immediately followed by 3 and 5.
The mean of 3 and 5 is 4, (3 5)/2 = 4.0
So the result for 1 is 4.
Using the same method for 2:
2 is found at positions 4, 10 and 11 and followed by 4 and 2. The final 2 at the end of the list is discarded as it is folloewd by nothing.
So the result for 2 is (4 2)/2 = 3.0
If I go on with this method and present the results as a dictionary I obtain this.
results = {
1: 4.0,
2: 3.0,
3: 5.0, # 5/1
4: 3.0, # (1 5)/2
5: 3.25, # (5 2 4 2)/4
}
I need to automate this procedure in an efficient way because it is supposed to run on very long lists.
I want to solve this using pandas or numpy but I am a total beginner with these packages.
I am of course reading the documentations but they are so long that I feel like I will find a solution in two years :D
Any help, shortcut or link to the right parts of the docs would be appreciated.
The results don't have to be a dictionary. It can be anything, like for instance a new dataframe, as long as the computation is efficient, and elegant if possible.
Thanks for your time !
CodePudding user response:
How about using collections.defaultdict
and zip
(or itertools.pairwise
for python 3.10 ):
from collections import defaultdict
numbers = [1, 3, 5, 5, 2, 4, 1, 5, 4, 5, 2, 2]
dct = defaultdict(list)
for x, y in zip(numbers, numbers[1:]):
# (Alternatively, on python 3.10 ) for x, y in itertools.pairwise(numbers):
dct[x].append(y)
dct = {k: sum(lst) / len(lst) for k, lst in dct.items()}
print(dct)
# {1: 4.0, 3: 5.0, 5: 3.25, 2: 3.0, 4: 3.0}
CodePudding user response:
Pandas Approach
s = pd.Series(numbers)
s.shift(-1).groupby(s).mean().to_dict()
{1: 4.0, 2: 3.0, 3: 5.0, 4: 3.0, 5: 3.25}
CodePudding user response:
You could create a dataframe using the numbers list zipped with itself offset by 1, then use groupby
to generate means for each number:
numbers = [1, 3, 5, 5, 2, 4, 1, 5, 4, 5, 2, 2]
df = pd.DataFrame(zip(numbers, numbers[1:]), columns=['num', 'next'])
df.groupby('num').mean().reset_index().rename(columns={'next':'mean'})
Output
num mean
0 1 4.00
1 2 3.00
2 3 5.00
3 4 3.00
4 5 3.25