Pivot pandas dataframe from single row to one row per item-CodePudding

I recently applied a transformation to unnest a nested json, in order to have a flat dataset to work with, and while the transformation works, the final format is not the one I am looking for. It compressed all the data into a single row and added suffixes to column names, instead of separating into different columns for each id_prop.

My dataset in JSON format to replicate with Pandas:

import pandas as pd
json = {"id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},"prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},"prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')

My result:

	id_prop.0	id_prop.1	id_prop.2	prop_number.0	prop_number.1	prop_number.2	prop_value.0	prop_value.1	prop_value.2
0	1	2	3	123	325	754	1	1	1

The result I expect:

	id_prop	prop_number	prop_value
0	1	123	1
1	2	325	1
2	3	754	1

Is there any way to pivot the dataframe into the format I need, where each row represents the values of a single id_prop?

Attemps

I have already extracted the names of the columns I need without suffixes:

def extract_cols(columns):
    myset = set()
    myset_add = myset.add
    return [x for x in columns if not (x in myset or myset_add(x))]

cols = extract_cols(df.columns.str.replace("\.[0-9]", "", regex=True))

And also "verticalized" the results I need using stack():

df_stacked = df.stack().reset_index(level=1, drop=True)

But I haven't figured out how to combine that info yet. Any help would be highly appreciated.

Extra:

If there is also a way to apply this using pyspark, then much better!

CodePudding user response：

You can use pd.wide_to_long:

print (pd.wide_to_long(df.assign(dummy=0),
                       stubnames=["id_prop", "prop_number", "prop_value"],
                       i="dummy", j="index", sep=".", suffix=".*")
         .reset_index(level=0, drop=True))

       id_prop  prop_number  prop_value
index
0            1          123           1
1            2          325           1
2            3          754           1

CodePudding user response：

You can split the columns to a multiindex and then stack it:

df.set_axis(
    pd.MultiIndex.from_tuples([tuple(i) for i in df.columns.str.split(".")]), axis=1
).stack().droplevel(0)

CodePudding user response：

Here's a way:

df = df.T
df.index = pd.MultiIndex.from_arrays([[x[x.find('.') 1:] for x in df.index], [x[:x.find('.')] for x in df.index]])
df = df.unstack()

Input:

   id_prop.0  id_prop.1  id_prop.2  prop_number.0  prop_number.1  prop_number.2  prop_value.0  prop_value.1  prop_value.2
0          1          2          3            123            325            754             1             1             1

Output:

  id_prop prop_number prop_value
0       1         123          1
1       2         325          1
2       3         754          1

Explanation:

transpose so we can work with the index instead of the columns
parse each label into the desired label (prefix) and the result number (suffix) split by the . character
update the df's index to be a MultiIndex with two levels: an list of result numbers and a list of desired labels
call unstack to pivot a level of the MultiIndex (the desired labels) to be column headings

UPDATE: To handle labels where result number is the second . separated token with additional tokens to its right (as described in OP's comment), we can do this:

import pandas as pd
json = {
    "building.0.description.bedrooms":{"0":"qrs"}, 
    "building.1.description.bedrooms":{"0":"tuv"}, 
    "building.2.description.bedrooms":{"0":"xyz"}, 

    "id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},
    "prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},
    "prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')
print(df.to_string())

df = df.T
df.index = pd.MultiIndex.from_arrays([[x.split('.')[1] for x in df.index], ['.'.join(x.split('.')[0:1]   x.split('.')[2:]) for x in df.index]])
df = df.unstack()
df.columns = df.columns.get_level_values(1)
print(df)

Input:

  building.0.description.bedrooms building.1.description.bedrooms building.2.description.bedrooms  id_prop.0  id_prop.1  id_prop.2  prop_number.0  prop_number.1  prop_number.2  prop_value.0  prop_value.1  prop_value.2
0                             qrs                             tuv                             xyz          1          2          3            123            325            754             1             1             1

Output:

  building.description.bedrooms id_prop prop_number prop_value
0                           qrs       1         123          1
1                           tuv       2         325          1
2                           xyz       3         754          1