I recently applied a transformation to unnest a nested json, in order to have a flat dataset to work with, and while the transformation works, the final format is not the one I am looking for. It compressed all the data into a single row and added suffixes to column names, instead of separating into different columns for each id_prop
.
My dataset in JSON format to replicate with Pandas:
import pandas as pd
json = {"id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},"prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},"prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')
My result:
id_prop.0 | id_prop.1 | id_prop.2 | prop_number.0 | prop_number.1 | prop_number.2 | prop_value.0 | prop_value.1 | prop_value.2 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 123 | 325 | 754 | 1 | 1 | 1 |
The result I expect:
id_prop | prop_number | prop_value | |
---|---|---|---|
0 | 1 | 123 | 1 |
1 | 2 | 325 | 1 |
2 | 3 | 754 | 1 |
Is there any way to pivot the dataframe into the format I need, where each row represents the values of a single id_prop
?
Attemps
I have already extracted the names of the columns I need without suffixes:
def extract_cols(columns):
myset = set()
myset_add = myset.add
return [x for x in columns if not (x in myset or myset_add(x))]
cols = extract_cols(df.columns.str.replace("\.[0-9]", "", regex=True))
And also "verticalized" the results I need using stack()
:
df_stacked = df.stack().reset_index(level=1, drop=True)
But I haven't figured out how to combine that info yet. Any help would be highly appreciated.
Extra:
If there is also a way to apply this using pyspark, then much better!
CodePudding user response:
You can use pd.wide_to_long
:
print (pd.wide_to_long(df.assign(dummy=0),
stubnames=["id_prop", "prop_number", "prop_value"],
i="dummy", j="index", sep=".", suffix=".*")
.reset_index(level=0, drop=True))
id_prop prop_number prop_value
index
0 1 123 1
1 2 325 1
2 3 754 1
CodePudding user response:
You can split the columns to a multiindex and then stack it:
df.set_axis(
pd.MultiIndex.from_tuples([tuple(i) for i in df.columns.str.split(".")]), axis=1
).stack().droplevel(0)
CodePudding user response:
Here's a way:
df = df.T
df.index = pd.MultiIndex.from_arrays([[x[x.find('.') 1:] for x in df.index], [x[:x.find('.')] for x in df.index]])
df = df.unstack()
Input:
id_prop.0 id_prop.1 id_prop.2 prop_number.0 prop_number.1 prop_number.2 prop_value.0 prop_value.1 prop_value.2
0 1 2 3 123 325 754 1 1 1
Output:
id_prop prop_number prop_value
0 1 123 1
1 2 325 1
2 3 754 1
Explanation:
- transpose so we can work with the index instead of the columns
- parse each label into the desired label (prefix) and the result number (suffix) split by the
.
character - update the df's index to be a MultiIndex with two levels: an list of result numbers and a list of desired labels
- call
unstack
to pivot a level of the MultiIndex (the desired labels) to be column headings
UPDATE: To handle labels where result number is the second .
separated token with additional tokens to its right (as described in OP's comment), we can do this:
import pandas as pd
json = {
"building.0.description.bedrooms":{"0":"qrs"},
"building.1.description.bedrooms":{"0":"tuv"},
"building.2.description.bedrooms":{"0":"xyz"},
"id_prop.0":{"0":1},"id_prop.1":{"0":2},"id_prop.2":{"0":3},
"prop_number.0":{"0":123},"prop_number.1":{"0":325},"prop_number.2":{"0":754},
"prop_value.0":{"0":1},"prop_value.1":{"0":1},"prop_value.2":{"0":1}}
df = pd.DataFrame.from_dict(json, orient='columns')
print(df.to_string())
df = df.T
df.index = pd.MultiIndex.from_arrays([[x.split('.')[1] for x in df.index], ['.'.join(x.split('.')[0:1] x.split('.')[2:]) for x in df.index]])
df = df.unstack()
df.columns = df.columns.get_level_values(1)
print(df)
Input:
building.0.description.bedrooms building.1.description.bedrooms building.2.description.bedrooms id_prop.0 id_prop.1 id_prop.2 prop_number.0 prop_number.1 prop_number.2 prop_value.0 prop_value.1 prop_value.2
0 qrs tuv xyz 1 2 3 123 325 754 1 1 1
Output:
building.description.bedrooms id_prop prop_number prop_value
0 qrs 1 123 1
1 tuv 2 325 1
2 xyz 3 754 1