Home > OS >  How to extract some values from string in column in DataFrame in Pandas Python?
How to extract some values from string in column in DataFrame in Pandas Python?

Time:08-25

I have DataFrame in Python Pandas like below ("col1" is "object" data type):

col1
------------
{"CITY":"MADRID","A_DT":"2022-05-10T00:00","PLACE":"XYT"}
{"CITY":"BERLIN","A_DT":"2022-07-11T00:00","PLACE":"TR"}
{"CITY":"BOSTON","A_DT":null,"PLACE":"abbnpK"}
...

And I need to create new column "col2" (string data type) where I need to have value which is between "A_DT":" and ","PLACE" in appropriate form for date, so as a result I need something like below:

col1                                                      | col2
----------------------------------------------------------|-------------
{"CITY":"MADRID","A_DT":"2022-05-10T00:00","PLACE":"XYT"} | 2022-05-10
{"CITY":"BERLIN","A_DT":"2022-07-11T00:00","PLACE":"TR"}  | 2022-07-11
{"CITY":"BOSTON","A_DT":null,"PLACE":"abbnpK"}            | null
...                                                       | ...

I tried to use code like below, but I have error and it does not work probably because sometimes there is value null sometimes value like "2022-07-11T00:00" with quote:

df.col1.str.split('A_DT":"').apply(lambda x: x[1].split('","PLACE"')

Can you suggest better solution appropriate for my DataFrame in Python Pandas ?

CodePudding user response:

That looks like json data. It's probably a good idea to parse it into a dataframe and work with that.

import json

df2 = pd.DataFrame.from_records(df.col1.apply(json.loads))

gives

     CITY              A_DT   PLACE
0  MADRID  2022-05-10T00:00     XYT
1  BERLIN  2022-07-11T00:00      TR
2  BOSTON              None  abbnpK

Now you can simply select the data via df2.A_DT.

CodePudding user response:

You can use this regex to get the string in between your 2 strings

df["date"] = df.col1.astype(str).str.extract("'A_DT': '(.*?)', 'PLACE")

Output:

print(df)
col1 date
0 {'CITY': 'MADRID', 'A_DT': '2022-05-10T00:00', 'PLACE': 'XYT'} 2022-05-10T00:00
1 {'CITY': 'BERLIN', 'A_DT': '2022-07-11T00:00', 'PLACE': 'TR'} 2022-07-11T00:00
  • Related