Home > front end >  Pandas split list upon DataFrame creation
Pandas split list upon DataFrame creation

Time:07-06

I have a JSON file coming in, which I am doing some operations/trimming on.

The result looks like this:

print("User:", user)
> User: {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}

When applying df = pd.DataFrame(user, index=[0]) I get the following Dataframe:

     id   label    position  velocity
0    1    female   NaN       0.8

When applying df = pd.DataFrame(user) I get:

      id   label    position     confidence
lat   1    female   47.72485566  0.8
lon   1    female   10.32219439  0.8

I am aware, as to why that happens, however none is what I want.

I'd like the following:

     id   label    lat          lon           confidence
0    1    female   47.72485566  10.32219439   0.8

However I am not sure what the best way is to split the position parameter.

CodePudding user response:

You can just pandas.json_normalize , then later rename the columns:

>>> df = pd.json_normalize({'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8})
>>> df = df.rename(columns={'position.lat': 'lattitude', 'position.lon': 'longitude'})

OUTPUT

id   label  confidence  lattitude  longitude
0   1  female         0.8  47.724856  10.322194

CodePudding user response:

If I understand you correct, you want to remove the dict at 'position' and include it in the original dict

user = {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}
user.update(user.pop('position'))  # happens in-place

pd.DataFrame(user, index=[0])


>>>    id   label   confidence  lat         lon
>>> 0   1   female  0.8         47.724856   10.322194

However, it seems the input data fits more to the idea of a pandas Series here:

pd.Series(user)

EDIT: The solution from ThePyGuy seems more general on the cost of execution time. If this is critical or not depends on the situation.

%%timeit -n 10 -r 10
user = {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}
df = pd.json_normalize(user)
df = df.rename(columns={'position.lat': 'lattitude', 'position.lon': 'longitude'})
>>> 811 µs ± 121 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
%%timeit -n 10 -r 10
user = {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}
user.update(user.pop('position'))  # happens in-place
df = pandas.DataFrame(user, index=[0])
>>> 424 µs ± 45.6 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
%%timeit -n 10 -r 10
user = {'id': 1, 'label': 'female', 'position': {'lat': 47.72485566, 'lon': 10.32219439}, 'confidence': 0.8}
user.update(user.pop('position'))  # happens in-place
df = pandas.Series(user)
>>> 167 µs ± 20.2 µs per loop (mean ± std. dev. of 10 runs, 10 loops each)
  • Related