I am fetching data from YouTube API V3 using 'googleapiclient' python.
The response is vast and all I want is to parse some keys, then append them to a csv file.
In order to work fast and not loop over each item, each page has 50 items and multiple pages estimate number around 50, thus if I loop to extract the keys I have to do 250 iteration maybe more.
So, I decided to use pandas to speed up the process, but I don't know how yet.
Could you give me an idea how to speed up the parsing process, preferably using panda?
Here is one of the items from the response:
{
"kind": "youtube#searchListResponse",
"etag": "9C4YPSA6KJ2_ZQe6k0khyWyZw4U",
"nextPageToken": "CDIQAA",
"regionCode": "DE",
"pageInfo": { "totalResults": 569, "resultsPerPage": 50 },
"items": [
{
"kind": "youtube#searchResult",
"etag": "-tjutsrDQfkNJkMufUBxwHakEkE",
"id": { "kind": "youtube#video", "videoId": "wnnKjI1m2Ug" },
"snippet": {
"publishedAt": "2019-11-14T10:00:11Z",
"channelId": "UCVdfgrCLfJQfO5EgPlzaYAQ",
"title": "Was ist XML? Einfach und schnell erkl\u00e4rt!",
"description": "Werbung: Jetzt Premium Mitgliedschaft sichern ...",
"thumbnails": {
"default": {
"url": "https://i.ytimg.com/vi/wnnKjI1m2Ug/default.jpg",
"width": 120,
"height": 90
},
"medium": {
"url": "https://i.ytimg.com/vi/wnnKjI1m2Ug/mqdefault.jpg",
"width": 320,
"height": 180
},
"high": {
"url": "https://i.ytimg.com/vi/wnnKjI1m2Ug/hqdefault.jpg",
"width": 480,
"height": 360
}
},
"channelTitle": "Programmieren Starten",
"liveBroadcastContent": "none",
"publishTime": "2019-11-14T10:00:11Z"
}
}
I would like to extract from each item:
['id'] > ['videoId']
['snippet'] > ['title']
['snippet'] > ['channelTitle']
Thank you.
CodePudding user response:
I would be surprised if you could use pandas to help speed this up. Pandas is a library for manipulating and processing dataframes. Perhaps you could use pandas to construct a dataframe of this data, or to save it as a CSV, but I don't think it will help in the basic processing.
To process this data I think you just need to apply the function you want, i.e. gathering those three datapoints you are looking for, to the data you have. Your response is coming back as json, so parse it as json, take the items list, and then for each item in your items list extract the data you want.
import json
item_list = json.loads(YOUR_RESPONSE)["items"]
def extract(item):
return [item["id"]["videoId"], item["snippet"]["title"], item["snippet"]["channelTitle"]]
for item in item_list:
print(extract(item))
I'm not sure what you want to do with the extracted information once you have it, but this approach will let you get the values you are care about out of the items.
CodePudding user response:
You could use pandas json_normalize to flatten the nested data. Combined with filtering the desired columns this would make (assuming the input is a dict called data
):
import pandas as pd
df = pd.json_normalize(data['items'])[['id.videoId', 'snippet.title', 'snippet.channelTitle']]
result:
id.videoId | snippet.title | snippet.channelTitle | |
---|---|---|---|
0 | wnnKjI1m2Ug | Was ist XML? Einfach und schnell erklärt! | Programmieren Starten |