Home > database >  How do I convert bytes to utf-8 without turning regular strings into NaNs?
How do I convert bytes to utf-8 without turning regular strings into NaNs?

Time:12-15

I have a process that runs on multiple pandas dataframes. Sometimes the data comes in the form of bytes, such as:

>>> pd.DataFrame[['x']]
['x']
b'123'
b'111'
b'110'

And other times it comes in the form of regular integers

>>> pd.DataFrame[['x']]
['x']
80
123
491

I want to convert the bytes to unicode-8 and leave the regular integers untouched. Right now, I tried pd.Dataframe['x'].str.decode('unicode-8') and it works when the dataframe comes in the form of bytes, but it turns all the values to NaN when the dataframe comes in the form of integers.

I want the solution to be vectorized because speed is important. I can't use list comprehension, for example.

CodePudding user response:

One way to do what you've asked is to infer the dtype for the column and only attempt to convert it from bytes if it's non-numeric:

if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
    df['x'] = df['x'].str.decode('utf-8')

Test code:

import pandas as pd

df = pd.DataFrame({'x':[b'123',b'111',b'110']})
print('','before',df,sep='\n')
if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
    df['x'] = df['x'].str.decode('utf-8')
print('','after',df,sep='\n')

df = pd.DataFrame({'x':[80,123,491]})
print('','before',df,sep='\n')
if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
    df['x'] = df['x'].str.decode('utf-8')
print('','after',df,sep='\n')

Output:

before
        x
0  b'123'
1  b'111'
2  b'110'

after
     x
0  123
1  111
2  110

before
     x
0   80
1  123
2  491

after
     x
0   80
1  123
2  491

UPDATE: If the column is partially in bytes, such as x b'80' 123, this will work:

import pandas as pd
import numpy as np

df = pd.DataFrame({'x':[b'80',123,491]})
print('','before',df,sep='\n')
df.x = np.where(df.x.astype(np.int64) == df.x, df.x.astype(str).str.encode('utf-8'), df.x)
df.x = df.x.str.decode('utf-8')
print('','after',df,sep='\n')

Output:

before
       x
0  b'80'
1    123
2    491

after
     x
0   80
1  123
2  491

CodePudding user response:

You can define a function to first check before decoding. Something like:

import pandas as pd

# Define the decode_if_bytes function
def decode_if_bytes(input_str):
    if isinstance(input_str, bytes):
        return input_str.decode('utf-8')
    return input_str

Decode df

# Apply the function to the dataframe
df = pd.DataFrame({'x':[b'80',123,491]})
df['x'] = df['x'].apply(decode_if_bytes)

print(df)

Output:

x
0   80
1  123
2  491

Decode another df

df = pd.DataFrame({'x':[b'123',b'111',b'110']})
df['x'] = df['x'].apply(decode_if_bytes)

print(df)

Output:

x
0  123
1  111
2  110
  • Related