How to transform a time series into a two-column dataframe showing the count for each element of the-CodePudding

I have data in a file that takes the form of a list of array: each line correspond to an array of integers, with the first element of each array (it is a time series) corresponding to an index. Here is an example :

1 101 103 238 156 48 78
2 238 420 156 103 26
3 220 103 154 48 101 238 156 26 420
4 26 54 43 103 156 238 48

there isn't the same number of element in each line and some elements are present in more than one line, but others are not.

I would like, using python, to transform the data so that I have 2 columns: the first corresponds to the list of all the integers appearing in the original dataset and the other is the count of the number of occurences. i.e. in the example given:

Could anyone please let me know how I could do that? Is there a straightfoward way to do this using Pandas or Numpy for example? Many thanks in advance!

CodePudding user response：

import pandas as pd
array1 =  [1, 101, 103, 238, 156, 48, 78]
array2 = [2, 238, 420, 156, 103, 26]
array3 = [3, 220, 103, 154, 48, 101, 238, 156, 26, 420]
array4 = [4, 26, 54, 43, 103, 156, 238, 48]
pd.Series(list(array1   array2   array3   array4)).value_counts()

CodePudding user response：

What you are asking is how to create a ferquenzy distribution from multiple arrays. There are many solutions to this problem. You can solve it using numpy. Lets say you have the following multidimensional array

time_series = numpy.array([[0,1,2],[3,4],[5,6,7,8]])

Then you can concatenate the multi-dimensional list into a one-dimensional array, and use numpy.unique to find the frequency distribution. numpy.unique returns two arrays, unique and counts, which is concatenated using vstack.

temp=numpy.concatenate(time_series).ravel().tolist()
distribution = numpy.vstack([numpy.unique(temp, return_counts=True)])