Home > OS >  Unable to preprocess data and create DataFrame
Unable to preprocess data and create DataFrame

Time:07-28

I am receiving this dictionary of strings as an API response, which will be further used as an input for another task.

{' cAVDNIIB ': ' pattern not matched: "[2022-07-25 06:40:51.147] [Information] LW () <RDP> Bot execution completed - JOBNAME:\\"TEST_NAME_1\\" JOBID:\\"2022072564027\\" TYPE:\\"Desktop\\" startTime:\\"1658731228\\" endTime:\\"1658731251\\" botMachineName:\\"LW\\" botMachineIP:\\"SOME_IP\\" state:\\"completed\\" status:\\"successful\\" Action:\\"run\\"" ',
 ' cQVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.711] [Information] LW () <RDP> Bot consumer listening for messages." ',
 ' cgVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.714] [Information] LW () <RDP> Message received from queue - JOBNAME:\\"TEST_NAME\\" JOBID:\\"2022072573011\\" TYPE:\\"Desktop\\" startTime:\\"1658734211\\" Action:\\"run\\"" ',
 ' cwVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.717] [Information] LW () <RDP> Bot action:\\"run\\"" ',
 ' dAVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.717] [Information] LW () <RDP>  [x] \\"{\\\\\\"JOBNAME\\\\\\":\\\\\\"TEST_NAME\\\\\\",\\\\\\"TYPE\\\\\\":\\\\\\"Desktop\\\\\\",\\\\\\"rdpRequired\\\\\\":false,\\\\\\"state\\\\\\":\\\\\\"Running\\\\\\",\\\\\\"status\\\\\\":\\\\\\"Running\\\\\\",\\\\\\"requestorId\\\\\\":null,\\\\\\"JOBID\\\\\\":\\\\\\"2022072573011\\\\\\",\\\\\\"startTime\\\\\\":null,\\\\\\"endTime\\\\\\":null,\\\\\\"botstatus\\\\\\":null,\\\\\\"action\\\\\\":\\\\\\"run\\\\\\"}\\"" ',
 ' dQVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.718] [Information] LW () <RDP> Creating background process \'\\"C:\\\\Tools\\\\Flows\\\\Desktop\\\\TEST_NAME\\\\Simulator.exe\\"\'" ',
 ' gQVENIIB ': ' pattern not matched: "[2022-07-25 07:30:11.743] [Information] LW () <RDP> Bot execution started - JOBNAME:\\"TEST_NAME\\" JOBID:\\"2022072573011\\" TYPE:\\"Desktop\\" startTime:\\"1658734211\\" botMachineName:\\"LW\\" botMachineIP:\\"SOME_IP\\" state:\\"Running\\" status:\\"Running\\" Action:\\"run\\"" ',
 ' ggVENIIB ': ' pattern not matched: "[2022-07-25 07:31:23.368] [Information] LW () <RDP> Published message to response queue: ExchangeName:\\"liteportal.exchange\\", Message:\\"{\\\\\\"requestorId\\\\\\":null,\\\\\\"rdpRequired\\\\\\":false,\\\\\\"JOBNAME\\\\\\":\\\\\\"TEST_NAME\\\\\\",\\\\\\"JOBID\\\\\\":\\\\\\"2022072573011\\\\\\",\\\\\\"TYPE\\\\\\":\\\\\\"Desktop\\\\\\",\\\\\\"startTime\\\\\\":\\\\\\"1658734211\\\\\\",\\\\\\"endTime\\\\\\":\\\\\\"1658734283\\\\\\",\\\\\\"status\\\\\\":\\\\\\"successful\\\\\\",\\\\\\"action\\\\\\":\\\\\\"run\\\\\\",\\\\\\"state\\\\\\":\\\\\\"completed\\\\\\",\\\\\\"botMachineName\\\\\\":\\\\\\"LW\\\\\\",\\\\\\"botMachineOS\\\\\\":\\\\\\"WindowsXP 6.2.9200.0\\\\\\",\\\\\\"botMachineIP\\\\\\":\\\\\\"SOME_IP\\\\\\",\\\\\\"Message\\\\\\":null}\\" " ',
 ' gwVENIIB ': ' pattern not matched: "[2022-07-25 07:31:23.373] [Information] LW () <RDP> Log file path: \\"C:\\\\Tools\\\\Flows\\\\Desktop\\\\TEST_NAME\\\\logs\\\\Logs-2022_07_25.log\\"" ',
 ' hAVENIIB ': ' pattern not matched: "[2022-07-25 07:31:23.374] [Information] LW () <RDP> Log file exists" ',
 ' hgVENIIB ': ' pattern not matched: "[2022-07-25 07:31:23.384] [Information] LW () <RDP> Bot log file deleted: \\"C:\\\\Tools\\\\Flows\\\\Desktop\\\\TEST_NAME\\\\logs\\\\Logs-2022_07_25.log\\"" '}

Need to convert this input into a dataframe for only those having all these values:

jobid jobname type startTime endTime state status
2022072564027 TEST_NAME_1 Desktop 1658731228 1658731251 completed successful
2022072573011 TEST_NAME Desktop 1658734211 1658734283 completed successful

Tried parsing the key values for removing the special characters and backslashes and also by using regex to extract the desired matches but unable to run through each iteration for the correct matches.

Please share your thoughts and suggest the best possible solution in Python? Thank you for your help!!

CodePudding user response:

Definitely REGEX. You will have to use regex to get close to the item you need and use REGEX to extract the value you need. For example, to extract JOBID, I import re, a python module, and use re.search method to look for the JOBID indexes from the string value. Then I use that last index of JOBID with span attribute from re.search method to do another REGEX search. I then extract the value using group attribute and collect those value to the list. Notice that not all data that you collected contains JOBID, thus, I am returning "NaN" for those to the list.

Code Example:

import re

data = {
 ' cAVDNIIB ': ' pattern not matched: "[2022-07-25 06:40:51.147] [Information] LW () <RDP> Bot execution completed - JOBNAME:\\"TEST_NAME_1\\" JOBID:\\"2022072564027\\" TYPE:\\"Desktop\\" startTime:\\"1658731228\\" endTime:\\"1658731251\\" botMachineName:\\"LW\\" botMachineIP:\\"SOME_IP\\" state:\\"completed\\" status:\\"successful\\" Action:\\"run\\"" ',
 ' cQVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.711] [Information] LW () <RDP> Bot consumer listening for messages." ',
 ' cgVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.714] [Information] LW () <RDP> Message received from queue - JOBNAME:\\"TEST_NAME\\" JOBID:\\"2022072573011\\" TYPE:\\"Desktop\\" startTime:\\"1658734211\\" Action:\\"run\\"" ',
 ' cwVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.717] [Information] LW () <RDP> Bot action:\\"run\\"" ',
 ' dAVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.717] [Information] LW () <RDP>  [x] \\"{\\\\\\"JOBNAME\\\\\\":\\\\\\"TEST_NAME\\\\\\",\\\\\\"TYPE\\\\\\":\\\\\\"Desktop\\\\\\",\\\\\\"rdpRequired\\\\\\":false,\\\\\\"state\\\\\\":\\\\\\"Running\\\\\\",\\\\\\"status\\\\\\":\\\\\\"Running\\\\\\",\\\\\\"requestorId\\\\\\":null,\\\\\\"JOBID\\\\\\":\\\\\\"2022072573011\\\\\\",\\\\\\"startTime\\\\\\":null,\\\\\\"endTime\\\\\\":null,\\\\\\"botstatus\\\\\\":null,\\\\\\"action\\\\\\":\\\\\\"run\\\\\\"}\\"" ',
 ' dQVDNIIB ': ' pattern not matched: "[2022-07-25 07:30:11.718] [Information] LW () <RDP> Creating background process \'\\"C:\\\\Tools\\\\Flows\\\\Desktop\\\\TEST_NAME\\\\Simulator.exe\\"\'" ',
 ' gQVENIIB ': ' pattern not matched: "[2022-07-25 07:30:11.743] [Information] LW () <RDP> Bot execution started - JOBNAME:\\"TEST_NAME\\" JOBID:\\"2022072573011\\" TYPE:\\"Desktop\\" startTime:\\"1658734211\\" botMachineName:\\"LW\\" botMachineIP:\\"SOME_IP\\" state:\\"Running\\" status:\\"Running\\" Action:\\"run\\"" ',
 ' ggVENIIB ': ' pattern not matched: "[2022-07-25 07:31:23.368] [Information] LW () <RDP> Published message to response queue: ExchangeName:\\"liteportal.exchange\\", Message:\\"{\\\\\\"requestorId\\\\\\":null,\\\\\\"rdpRequired\\\\\\":false,\\\\\\"JOBNAME\\\\\\":\\\\\\"TEST_NAME\\\\\\",\\\\\\"JOBID\\\\\\":\\\\\\"2022072573011\\\\\\",\\\\\\"TYPE\\\\\\":\\\\\\"Desktop\\\\\\",\\\\\\"startTime\\\\\\":\\\\\\"1658734211\\\\\\",\\\\\\"endTime\\\\\\":\\\\\\"1658734283\\\\\\",\\\\\\"status\\\\\\":\\\\\\"successful\\\\\\",\\\\\\"action\\\\\\":\\\\\\"run\\\\\\",\\\\\\"state\\\\\\":\\\\\\"completed\\\\\\",\\\\\\"botMachineName\\\\\\":\\\\\\"LW\\\\\\",\\\\\\"botMachineOS\\\\\\":\\\\\\"WindowsXP 6.2.9200.0\\\\\\",\\\\\\"botMachineIP\\\\\\":\\\\\\"SOME_IP\\\\\\",\\\\\\"Message\\\\\\":null}\\" " ',
 ' gwVENIIB ': ' pattern not matched: "[2022-07-25 07:31:23.373] [Information] LW () <RDP> Log file path: \\"C:\\\\Tools\\\\Flows\\\\Desktop\\\\TEST_NAME\\\\logs\\\\Logs-2022_07_25.log\\"" ',
 ' hAVENIIB ': ' pattern not matched: "[2022-07-25 07:31:23.374] [Information] LW () <RDP> Log file exists" ',
 ' hgVENIIB ': ' pattern not matched: "[2022-07-25 07:31:23.384] [Information] LW () <RDP> Bot log file deleted: \\"C:\\\\Tools\\\\Flows\\\\Desktop\\\\TEST_NAME\\\\logs\\\\Logs-2022_07_25.log\\"" '}

## to collect the values you need
jobid_temp = []

for k,v in data.items():
    ## identify JOBID index values from the string
    search0 = re.search('JOBID',data[k])

    ## JOBID pattern not existing will return None type by default
    if re.search('JOBID',v) == None:
        jobid_temp.append("NaN") ## append NaN value to the list

        ## for illustration purpose
        print(k, 'Nothing Exist')

    else:
        ## use index to make the string smaller
        string1 = v[search0.span()[1]:]
        ## extract the value
        jobid = re.search('\d ', string1).group(0)
        jobid_temp.append(jobid)

        ## for illustration purpose
        print(k, jobid)

Output from print statement:

 cAVDNIIB  2022072564027
 cQVDNIIB  Nothing Exist
 cgVDNIIB  2022072573011
 cwVDNIIB  Nothing Exist
 dAVDNIIB  2022072573011
 dQVDNIIB  Nothing Exist
 gQVENIIB  2022072573011
 ggVENIIB  2022072573011
 gwVENIIB  Nothing Exist
 hAVENIIB  Nothing Exist
 hgVENIIB  Nothing Exist

CodePudding user response:

I think you can just loop through it line by line and then use something like this: Does Python have a string 'contains' substring method?

Otherwise, you would have to be a little more specific about what the problem is you're running into.

  • Related