How do I convert bytes to utf-8 without turning regular strings into NaNs?-CodePudding

I have a process that runs on multiple pandas dataframes. Sometimes the data comes in the form of bytes, such as:

>>> pd.DataFrame[['x']]
['x']
b'123'
b'111'
b'110'

And other times it comes in the form of regular integers

>>> pd.DataFrame[['x']]
['x']
80
123
491

I want to convert the bytes to unicode-8 and leave the regular integers untouched. Right now, I tried pd.Dataframe['x'].str.decode('unicode-8') and it works when the dataframe comes in the form of bytes, but it turns all the values to NaN when the dataframe comes in the form of integers.

I want the solution to be vectorized because speed is important. I can't use list comprehension, for example.

CodePudding user response：

One way to do what you've asked is to infer the dtype for the column and only attempt to convert it from bytes if it's non-numeric:

if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
    df['x'] = df['x'].str.decode('utf-8')

Test code:

import pandas as pd

df = pd.DataFrame({'x':[b'123',b'111',b'110']})
print('','before',df,sep='\n')
if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
    df['x'] = df['x'].str.decode('utf-8')
print('','after',df,sep='\n')

df = pd.DataFrame({'x':[80,123,491]})
print('','before',df,sep='\n')
if not pd.api.types.is_numeric_dtype(df['x'].infer_objects().dtypes):
    df['x'] = df['x'].str.decode('utf-8')
print('','after',df,sep='\n')

Output:

before
        x
0  b'123'
1  b'111'
2  b'110'

after
     x
0  123
1  111
2  110

before
     x
0   80
1  123
2  491

after
     x
0   80
1  123
2  491

UPDATE: If the column is partially in bytes, such as x b'80' 123, this will work:

import pandas as pd
import numpy as np

df = pd.DataFrame({'x':[b'80',123,491]})
print('','before',df,sep='\n')
df.x = np.where(df.x.astype(np.int64) == df.x, df.x.astype(str).str.encode('utf-8'), df.x)
df.x = df.x.str.decode('utf-8')
print('','after',df,sep='\n')

Output:

before
       x
0  b'80'
1    123
2    491

after
     x
0   80
1  123
2  491

CodePudding user response：

You can define a function to first check before decoding. Something like:

import pandas as pd

# Define the decode_if_bytes function
def decode_if_bytes(input_str):
    if isinstance(input_str, bytes):
        return input_str.decode('utf-8')
    return input_str

Decode df

# Apply the function to the dataframe
df = pd.DataFrame({'x':[b'80',123,491]})
df['x'] = df['x'].apply(decode_if_bytes)

print(df)

Output:

Decode another df

df = pd.DataFrame({'x':[b'123',b'111',b'110']})
df['x'] = df['x'].apply(decode_if_bytes)

print(df)

Output: