I have a data frame with list of dictionary (with unequal length) and i want to create a new column based on key dictionary and dictionary value as a column value
criteria 0 [{'Seniority level': 'Entry level'}, {'Employm... 1 [{'Employment type': 'Full-time'}] 2 [{'Seniority level': 'Associate'}, {'Employmen... 3 [{'Employment type': 'Part-time'}] 4 [{'Seniority level': 'Mid-Senior level'}, {'Em...
... ... 2768 [{'Seniority level': 'Entry level'}, {'Employm... 2769 [{'Seniority level': 'Entry level'}, {'Employm... 2770 [{'Seniority level': 'Entry level'}, {'Employm... 2771 [{'Seniority level': 'Mid-Senior level'}, {'Em... 2772 [{'Seniority level': 'Entry level'}, {'Employm...
I want to create the new column like this
CodePudding user response:
I have a function that does something along those lines:
import pandas as pd
def reformat_json_column(dataframe: pd.DataFrame, column_name: str) -> pd.DataFrame:
"""
Split a list of JSON data with one line per element of the JSON
Each key of the JSON data is then used to construct a column and store the
related data
"""
data = dataframe.explode(column_name).reset_index(drop=True)
data = pd.concat(
[
data.drop(column_name, axis=1),
pd.json_normalize(data[column_name]), # type: ignore
],
axis=1,
)
return data
Here is a working example:
test_df = pd.DataFrame(
{
"a": [1, 2, 3],
"b": [
[{"c": 4, "d": 5}],
[{"c": 6, "d": 7}],
[{"c": 8, "d": 9}, {"c": 10, "d": 11}],
],
}
)
assert_df = pd.DataFrame(
{"a": [1, 2, 3, 3], "c": [4, 6, 8, 10], "d": [5, 7, 9, 11]}
)
pd.testing.assert_frame_equal(reformat_json_column(test_df, "b"), assert_df)
CodePudding user response:
To create a new column in a pandas DataFrame based on a dictionary, you can use the DataFrame.apply()
method. This method allows you to apply a function to each row or column of the DataFrame and add the result as a new column.
Here is an example of how you could create a new column in a DataFrame based on a list of dictionaries with unequal length:
import pandas as pd
# Create a DataFrame with a list of dictionaries
df = pd.DataFrame([{'col1': 1, 'col2': 2, 'col3': 3},
{'col1': 4, 'col3': 5},
{'col2': 6}])
# Define a function that extracts the value of the 'col3' key from a dictionary
def get_col3_value(row):
if 'col3' in row:
return row['col3']
else:
return None
# Apply the function to each row of the DataFrame and add the result as a new column
df['col4'] = df.apply(get_col3_value, axis=1)
# Print the resulting DataFrame
print(df)
# Output:
# col1 col2 col3 col4
# 0 1 2.0 3.0 3.0
# 1 4 NaN 5.0 5.0
# 2 NaN 6.0 NaN NaN
In this code, the DataFrame.apply()
method is used to apply the get_col3_value()
function to each row of the DataFrame. The function extracts the value of the col3
key from the dictionary, and returns None
if the key is not present. The result of the function is added as a new column col4
in the DataFrame.
You can modify this approach to use a different key and function to create the new column in the DataFrame. Just make sure to adjust the function accordingly to extract the correct value from the dictionary.