I have data in a file that takes the form of a list of array: each line correspond to an array of integers, with the first element of each array (it is a time series) corresponding to an index. Here is an example :
1 101 103 238 156 48 78
2 238 420 156 103 26
3 220 103 154 48 101 238 156 26 420
4 26 54 43 103 156 238 48
there isn't the same number of element in each line and some elements are present in more than one line, but others are not.
I would like, using python, to transform the data so that I have 2 columns: the first corresponds to the list of all the integers appearing in the original dataset and the other is the count of the number of occurences. i.e. in the example given:
26 3
43 1
48 3
54 1
78 1
101 2
103 4
154 1
156 4
220 1
238 4
420 2
Could anyone please let me know how I could do that? Is there a straightfoward way to do this using Pandas or Numpy for example? Many thanks in advance!
CodePudding user response:
import pandas as pd
array1 = [1, 101, 103, 238, 156, 48, 78]
array2 = [2, 238, 420, 156, 103, 26]
array3 = [3, 220, 103, 154, 48, 101, 238, 156, 26, 420]
array4 = [4, 26, 54, 43, 103, 156, 238, 48]
pd.Series(list(array1 array2 array3 array4)).value_counts()
CodePudding user response:
What you are asking is how to create a ferquenzy distribution from multiple arrays. There are many solutions to this problem. You can solve it using numpy. Lets say you have the following multidimensional array
time_series = numpy.array([[0,1,2],[3,4],[5,6,7,8]])
Then you can concatenate the multi-dimensional list into a one-dimensional array, and use numpy.unique
to find the frequency distribution. numpy.unique
returns two arrays, unique
and counts
, which is concatenated using vstack.
temp=numpy.concatenate(time_series).ravel().tolist()
distribution = numpy.vstack([numpy.unique(temp, return_counts=True)])