Home > Software design >  Why does .decode() result in NaN values for byte strings that resulted from the pd.read_sql_query fu
Why does .decode() result in NaN values for byte strings that resulted from the pd.read_sql_query fu

Time:07-01

I have a MS SQL database that I want to query with Python. I use the following snippet:

cnxn = pypyodbc.connect("Driver={SQL Server Native Client 11.0};"
                        "Server=xxxx;"
                        "Database=xxxx;"
                        "Trusted_Connection=yes;")

query = 'select * from orders'
df = pd.read_sql_query(query)

This returns the following dataframe:

                      id    ordernumber
0  b'xxxx-xxxx-xxxx-xxxx'   123456789
1  b'xxxx-xxxx-xxxx-xxxx'   123456789
2  b'xxxx-xxxx-xxxx-xxxx'   123456789

The columns are both of type object. The id column is supposed to be a string, but it is a byte string in this output. How do I transform this column to a regular string object? I tried the following (from here):

df['id'].str.decode('utf-8')

... but that transforms the entire id column into np.nan values. How do I properly transform my DataFrame id column into a regular string object?

Dict to recreate this DataFrame:

{'id': {0: "b'DE91EBA3-313D-463C-B948-00010AA26136'",
  1: "b'316E587F-7FDD-4FBA-8778-0001E7783025'",
  2: "b'F6E50A95-A3E8-45D3-8E79-000210CCA14A'",
  3: "b'EE20A958-0CD6-4144-9743-00024D3E703F'",
  4: "b'58AFA9B9-4B10-47D3-8840-000AAD9CBF00'"},
 'ordernumber': {0: '500358152',
  1: '500489603',
  2: '500454759',
  3: '500512969',
  4: '500517135'}}

CodePudding user response:

You don't get bytes b'DE91EBA3-313D-463C-B948-00010AA26136'
but string "b'DE91EBA3-313D-463C-B948-00010AA26136'" and this makes difference.

You have to only remove b' ' from string

df['id'].str.replace("b'", "").str.replace("'", "")

or simpler

df['id'].str[2:-1]

Minimal working code:

import pandas as pd

df = pd.DataFrame({
  'id': {
    0: "b'DE91EBA3-313D-463C-B948-00010AA26136'",
    1: "b'316E587F-7FDD-4FBA-8778-0001E7783025'",
    2: "b'F6E50A95-A3E8-45D3-8E79-000210CCA14A'",
    3: "b'EE20A958-0CD6-4144-9743-00024D3E703F'",
    4: "b'58AFA9B9-4B10-47D3-8840-000AAD9CBF00'"
  },
  'ordernumber': {
    0: '500358152',
    1: '500489603',
    2: '500454759',
    3: '500512969',
    4: '500517135'
  }
})

#df['id'] = df['id'].str[2:-1]
df['id'] = df['id'].str.replace("b'", "").str.replace("'", "")

print(df)

Result:

                                     id ordernumber
0  DE91EBA3-313D-463C-B948-00010AA26136   500358152
1  316E587F-7FDD-4FBA-8778-0001E7783025   500489603
2  F6E50A95-A3E8-45D3-8E79-000210CCA14A   500454759
3  EE20A958-0CD6-4144-9743-00024D3E703F   500512969
4  58AFA9B9-4B10-47D3-8840-000AAD9CBF00   500517135
  • Related