Split string in columns in Python-CodePudding

I have a list like this:

[[{'contributionScore': 0.841473400592804, 'variable': 'series_2'},
  {'contributionScore': 0.6113986968994141, 'variable': 'series_3'},
  {'contributionScore': 0.5985525250434875, 'variable': 'series_1'},
  {'contributionScore': 0.5641148686408997, 'variable': 'series_4'},
  {'contributionScore': 0.138543963432312, 'variable': 'series_0'}],

 [{'contributionScore': 1.1316605806350708, 'variable': 'series_1'},
  {'contributionScore': 0.5188271403312683, 'variable': 'series_4'},
  {'contributionScore': 0.38711458444595337, 'variable': 'series_3'},
  {'contributionScore': 0.35055238008499146, 'variable': 'series_0'},
  {'contributionScore': 0.06044715642929077, 'variable': 'series_2'}]]

How can I obtain a dataframe with a column for each series?

I'd like to get a dataframe with contributionScore for each series.

Thanks!

CodePudding user response：

You should be able to create a dataframe using pd.DataFrame(). Since each element in the list would be a dataframe itself, you can try using a list comprehension.

Let's say the list its called "raw_list":

df = pd.concat([pd.DataFrame(x).pivot_table(columns='variables') for x in raw_list])

This would output:

   contributionScore  variable
0           0.841473  series_2
1           0.611399  series_3
2           0.598553  series_1
3           0.564115  series_4
4           0.138544  series_0

EDIT:

Given OPs comment, we should pivot the table first so:

df = pd.concat([pd.DataFrame(x).pivot_table(columns='variables') for x in raw_list])

Outputting:

variable           series_0  series_1  series_2  series_3  series_4
contributionScore  0.138544  0.598553  0.841473  0.611399  0.564115
contributionScore  0.350552  1.131661  0.060447  0.387115  0.518827

CodePudding user response：

I am a bit confused with the statement

How can I obtain a dataframe with a column for each series?

if you meant a single column, for all the series data with column "variable" then Celius Stingher's answer should be good enough.

If you meant as in each series value as its own individual column, I will extend on Celius's answer as :

##As already stated above
df = pd.concat([pd.DataFrame(x) for x in raw_list])
##To get a sorted list of unique Series values
series_list = sorted(df['variable'].unique())
##We first get a dictionary where each key is the unique series value and each dictionary value is the list of contributionScore unique to that series value. We turn it into a DataFrame in the end
series_df = pd.DataFrame({series : list(df[df['variable'] == series].["contributionScore"]) for series in series_list})

The output will look like

    series_0    series_1    series_2    series_3    series_4
0   0.138544    0.598553    0.841473    0.611399    0.564115
1   0.350552    1.131661    0.060447    0.387115    0.518827

A reminder that this will work only when the series values all have the same count of contribution score.(all series have 2 contribution scores each above)

If each series has different counts of contribution score values, this will work when replaced with the third statement :

## We turn each "series" value and their contribution score as dataframe and concatenate them to accommodate for the varying array lengths of each "series" column.
series_df = pd.concat([pd.DataFrame({series : list(df[df['variable'] == series]["contributionScore"])}) for series in series_list], axis = 1)

Example : If series_3 had 3 contribution Scores it will look like this

    series_0    series_1    series_2    series_3    series_4
0   0.138544    0.598553    0.841473    0.611399    0.564115
1   0.350552    1.131661    0.060447    0.387115    0.518827
2   NaN         NaN         NaN         1.200000    NaN

What pd.concat does here is that it allows us to join pandas DataFrames of different column lengths together. It fills the gap values with NaN. Something that wasnt possible with a mere pd.DataFrame() all together before. The "axis = 1" param tells the function to join the DataFrames created in the list to be "Concatenated" along the columns each.