Home > database >  Extracting Key Value pairs from a String using Regex
Extracting Key Value pairs from a String using Regex

Time:11-14

I have a web scrapped string containing key value pairs i.e firstName:"Quaran", lastName:"McPherson"

st = '{"accountId":405266,"firstName":"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null},{"accountId":375964,"firstName":"Micole","lastName":"Cayton","accountIdentifier":"StudentAthlete","profilePicUrl":"https://opendorsepr.blob.core.windows.net/media/375964/20220622223838_46dbe3fd-a683-436b-84d4-90c84a5af35f.jpg","networkProfileCode":"micole-cayton","hasDeals":true,"activityMin":16,"sports":["Basketball","Women\'s Basketball"],"currentTeams":["Minnesota Golden Gophers"],"previousTeams":["Cal Berkeley Golden Bears"],"facebookReach":0,"twitterReach":1273,"instagramReach":5700,"linkedInReach":null}'

I am trying to extract the first_name, last_name and a few other parameters from this string in list format such that I will be having a first_name list with all first_names from the string

I tried using re.findall('"firstName":'"(.*)\S$",st) to access the text "Quaran" but result is coming in the following format

'"Quaran","lastName":"McPherson","accountIdentifier":"StudentAthlete","profilePicUrl":"https://pbs.twimg.com/profile_images/1331475329014181888/4z19KrCf.jpg","networkProfileCode":"quaran-mcpherson","hasDeals":true,"activityMin":11,"sports":["Men\'s Basketball","Basketball"],"currentTeams":["Nebraska Cornhuskers"],"previousTeams":[],"facebookReach":null,"twitterReach":619,"instagramReach":0,"linkedInReach":null}

how do I end the specify within the regex to end the search at the end of the name in quotes??

TIA

CodePudding user response:

Try this regex (?<=\"firstName\":\").*?(?=\"). The ? in the middle makes it a lazy match, so that it stops matching as soon as it finds a " character.

CodePudding user response:

Your string seems JSON array, you can easily parse json in any language if it's valid. To make your string valid add '[' at first and ']' at last of your string then parse the JSON in your language. Such as

JavaScript:

JSON.parse(st)

Python:

import json
dict = json.loads(st)

Regular expression:

if you strictly wish to parse using regular expression use:

/(?:\"|\')(?<key>[\w\d] )(?:\"|\')(?:\:\s*)(?:\"|\')?(?<value>[\w\s-]*)(?:\"|\')?/gm

CodePudding user response:

Try this:

(?<="firstName":")[^"\r\n] 

(?<="firstName":") go to the point where "firstName":" appeasrs in the string,

[^"\r\n] then match one or more character except ", \r and \n. not to cross the second double quote of the firstName value and not to cross any newline.

See regex demo.

See python demo.

  • Related