Home > Net >  Using Google Colab how to drive.files().list more than 1000 files from google drive
Using Google Colab how to drive.files().list more than 1000 files from google drive

Time:12-10

About once a month I get a google drive folder with lots of videos in it (usually around 700-800) and a spreadsheet that column A gets populated with the names of all of the video files in order of the time stamp in the video file name. Now I've already got the code that does this (I will post it below) but This time I've got about 8,400 video files in the folder and this algorithm has a pageSize limit of 1,000 (it was originally 100, I changed it to 1,000 but that's the highest it will accept) How do I change this code to accept more than 1000

This is the part that initializes everything

!pip install gspread_formatting

import time
import gspread
from gspread import urls
from google.colab import auth
from datetime import datetime
from datetime import timedelta
from gspread_formatting import *
from googleapiclient.discovery import build
from oauth2client.client import GoogleCredentials
from google.auth import default


folder_id = '************************' # change to whatever folder the required videos are in

base_dir = '/Example/drive/videofolder' # change this to whatever folder path you want to grab videos from same as above

file_name_qry_filter = "name contains 'mp4' and name contains 'cam'"

file_pattern="cam*.mp4"

spreadSheetUrl = 'https://docs.google.com/spreadsheets/d/SpreadsheetIDExample/edit#gid=0'
data_drive_id = '***********' # This is the ID of the shared Drive


auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)
#gc = gspread.authorize(GoogleCredentials.get_application_default())
wb = gc.open_by_url(spreadSheetUrl)
sheet = wb.worksheet('Sheet1')

And this is the main part of the code

prevTimeStamp = None
prevHour = None

def dateChecker(fileName, prevHour):
  strippedFileName = fileName.strip(".mp4")             # get rid of the .mp4 from the end of the file name
  parsedFileName = strippedFileName.split("_")          # split the file name into an array of (0 = Cam#, 1 = yyyy-mm-dd, 2 = hh-mm-ss)
  timeStamp = parsedFileName[2]                         # Grabbed specifically the hh-mm-ss time section from the original file name
  parsedTimeStamp = timeStamp.split("-")                # split the time stamp into an array of (0 = hour, 1 = minute, 2 = second)
  hour = int(parsedTimeStamp[0])                              
  minute = int(parsedTimeStamp[1])
  second = int(parsedTimeStamp[2])                           # set hour, minute, and seccond to it's own variable
  commentCell = "Reset"

  if prevHour == None:
    commentCell = " "

    prevHour = hour

  else:
  
    if 0 <= hour < 24:

      if hour == 0:
        if prevHour == 23:
          commentCell = " "
        else:
          commentCell = "Missing Video1"

      else:
        if hour - prevHour == 1:
          commentCell = " "
        else:
          commentCell = "Missing Video2"

    else:
      commentCell = "Error hour is not between 0 and 23"

    if minute != 0 or 1 < second <60:
      commentCell = "Check Length"

  prevHour = hour

  return commentCell, prevHour




# Drive query variables
parent_folder_qry_filter = "'"   folder_id   "' in parents"  #you shouldn't ever need to change this
query = file_name_qry_filter   " and "   parent_folder_qry_filter
drive_service = build('drive', 'v3')

# Build request and call Drive API
page_token = None
response = drive_service.files().list(q=query,
                                      corpora='drive',
                                      supportsAllDrives='true',
                                      includeItemsFromAllDrives='true',
                                      driveId=data_drive_id,
                                      pageSize=1000,
                                      fields='nextPageToken, files(id, name, webViewLink)',  # you can add extra fields in the files() if you need more information about the files you're grabbing
                                      pageToken=page_token).execute()
i = 1
array = [[],[]]  
# Parse/print results
for file in response.get('files', []):
    array.insert(i-1, [file.get('name'), file.get('webViewLink')]) # If you add extra fields above, this is where you will have to start changing the code to make it accomadate the extra fields
    i = i   1  


array.sort()
array_sorted = [x for x in array if x]  #Idk man this is some alien shit I just copied it from the internet and it worked, it somehow removes any extra blank objects in the array that aren't supposed to be there 
arrayLength = len(array_sorted)
print(arrayLength)

commentCell = 'Error'

# for file_name in array_sorted:
#   date_gap, start_date, end_date = date_checker(file_name[0])
#   if prev_end_date == None:
#     print('hello')
#   elif start_date != prev_end_date:
#     date_gap = 'Missing Video'

for file_name in array_sorted:
  commentCell, prevHour = dateChecker(file_name[0], prevHour)

  time.sleep(0.3)
  #insertRow = [file_name[0], "Not Processed",  " ", date_gap, " ", " ", " ", " ", base_dir   '/'   file_name[0], " ", file_name[1], " ", " ", " "]
  insertRow = [file_name[0], "Not Processed",  " ", commentCell, " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "]
  sheet.append_row(insertRow, value_input_option='USER_ENTERED')

Now I know the problem has to do with the

page_token = None
response = drive_service.files().list(q=query,
                                      corpora='drive',
                                      supportsAllDrives='true',
                                      includeItemsFromAllDrives='true',
                                      driveId=data_drive_id,
                                      pageSize=1000,
                                      fields='nextPageToken, files(id, name, webViewLink)',  # you can add extra fields in the files() if you need more information about the files you're grabbing
                                      pageToken=page_token).execute()

In the middle of the main part of the code. I've obviously already tried just changing the pageSize limit to 10,000 but I knew that wouldn't work and I was right, it came back with

HttpError: <HttpError 400 when requesting https://www.googleapis.com/drive/v3/files?q=name contains 'mp4' and name contains 'cam' and '1ANmLGlNr-Cu0BvH2aRrAh_GXEDk1nWvf' in parents&corpora=drive&supportsAllDrives=true&includeItemsFromAllDrives=true&driveId=0AF92uuRq-00KUk9PVA&pageSize=10000&fields=nextPageToken, files(id, name, webViewLink)&alt=json returned "Invalid value '10000'. Values must be within the range: [1, 1000]". Details: "Invalid value '10000'. Values must be within the range: [1, 1000]">

The one idea I have is to have multiple pages with 1000 each and than iterate through them but I barely understood how this part of the code worked a year ago when I set it up and since than I haven't touched google colab except to run this algorithm and Every time I try to google how to do this or look up the google drive API or anything else everything always comes back with how to download and upload a couple file where what I need is just to get a list of the names of all the files.

CodePudding user response:

The documentation explains how to use the pageToken for pagination (the page is for Calendar API but it works the same in Drive):

In order to retrieve the next page, perform the exact same request as previously and append a pageToken field with the value of nextPageToken from the previous page. A new nextPageToken is provided on the following pages until all the results are retrieved.

Essentially you want a loop where you run files.list(), retrieve the pageToken, and run it again while feeding it the previous token until you stop getting tokens.

For your specific scenario you can try to replace the "problem" snippet with the following:

    page_token = ""
    filelist = {}
    while True:
        response = drive_service.files().list(q=query,
                                    corpora='drive',
                                    supportsAllDrives='true',
                                    includeItemsFromAllDrives='true',
                                    driveId=data_drive_id,
                                    pageSize=1000,
                                    fields='nextPageToken, files(id, name, webViewLink)',
                                    pageToken=page_token).execute()

        page_token = response.get('nextPageToken', None)
        filelist.setdefault("files",[]).extend(response.get('files'))

        if (not page_token):
            break

    response = filelist

This does as I described, looping files.list() and adding the results to the filelist variable, then breaking the loop when the API stops returning page tokens. At the end I just assigned the value of filelist to the response variable since that's what you're using in the rest of your code. It should parse the same way but with the full list of results this time.

Sources:

  • Related