I have a JSON file coming in, which I am doing some operations/trimming on.
The result looks like this:
print("User:", user)
> User: {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}
When applying df = pd.DataFrame(user, index=[0])
I get the following Dataframe:
id label position velocity
0 1 female NaN 0.8
When applying df = pd.DataFrame(user)
I get:
id label position confidence
lat 1 female 47.72485566 0.8
lon 1 female 10.32219439 0.8
I am aware, as to why that happens, however none is what I want.
I'd like the following:
id label lat lon confidence
0 1 female 47.72485566 10.32219439 0.8
However I am not sure what the best way is to split the position parameter.
CodePudding user response:
You can just pandas.json_normalize
, then later rename the columns:
>>> df = pd.json_normalize({'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8})
>>> df = df.rename(columns={'position.lat': 'lattitude', 'position.lon': 'longitude'})
OUTPUT
id label confidence lattitude longitude
0 1 female 0.8 47.724856 10.322194
CodePudding user response:
If I understand you correct, you want to remove the dict at 'position' and include it in the original dict
user = {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}
user.update(user.pop('position')) # happens in-place
pd.DataFrame(user, index=[0])
>>> id label confidence lat lon
>>> 0 1 female 0.8 47.724856 10.322194
However, it seems the input data fits more to the idea of a pandas Series here:
pd.Series(user)
EDIT: The solution from ThePyGuy seems more general on the cost of execution time. If this is critical or not depends on the situation.
%%timeit -n 10 -r 10
user = {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}
df = pd.json_normalize(user)
df = df.rename(columns={'position.lat': 'lattitude', 'position.lon': 'longitude'})
>>> 811 µs ± 121 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
%%timeit -n 10 -r 10
user = {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}
user.update(user.pop('position')) # happens in-place
df = pandas.DataFrame(user, index=[0])
>>> 424 µs ± 45.6 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
%%timeit -n 10 -r 10
user = {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}
user.update(user.pop('position')) # happens in-place
df = pandas.Series(user)
>>> 167 µs ± 20.2 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)