How to create an array of NA or Null values in Python?-CodePudding

This is easy to do in R and I am wondering if it is straight forward in Python and I am just missing something, but how do you create a vector of NaN values and Null values in Python? I am trying to do this using the np.full function.

R Code:

vec <- vector("character", 15)
vec[1:15] <- NA
vec

Python Code

unknowns = np.full(shape = 5, fill_value = ???, dtype = 'str')

'''test if fill value worked or not'''

random.seed(1177)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])

example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

print(example['transformed'].value_counts())

This should lead to 5 counts of unknown in the value counts total. Ideally I would like to know how to write this fill_value for NaN and Null and know whether it differs for variable types. I have tried np.nan with and without the string data type. I have tried None and Null with and without quotes. I cannot think of anything else to try and I am starting to wonder if it is possible. Thank you in advance and I apologize if this question is already addressed and for my lack of knowledge in this area.

CodePudding user response：

you could use either None or np.nan to create an array of just missing values in Python like so:

np.full(shape=5, fill_value=None)
np.full(shape=5, fill_value=np.nan)

back to your example, this works just fine:

import numpy as np
import pandas as pd

unknowns = np.full(shape=5, fill_value=None)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])
example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

print(example['transformed'].value_counts())

Lastly, this line is inefficient. example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

You do want to avoid loops & list comprehensions when using pandas

on large data, this is going to run much faster: example['transformed'] = example.categories.apply(lambda s: s if s else 'unknown')

CodePudding user response：

There is a typing problem here.

If you're working in numpy, vectors are typed after being initialized. Assigning a np.nan value to a vector initialized with strings will try to coalesce back into a string:

import numpy as np

v1 = np.array(['a', 'b', 'c'])
v1[0] = np.nan
# v1 = array(['n', 'b', 'c'], dtype='<U1')

v2 = np.array(['ab', 'cd', 'ef'])
v2[0] = np.nan
# v2 = array(['na', 'cd', 'ef'], dtype='<U2')

v3 = np.array(['abc', 'def', 'ghi'])
v3[0] = np.nan
# v3 = array(['nan', 'def', 'ghi'], dtype='<U3')

However, if you're working with pandas in the second half of the question, there's a separate way for handling missing data:

import pandas as pd

df = pd.DataFrame({"x": [pd.NA, "Hello", "World"]})

CodePudding user response：

A simple way to create an empty Series in pandas:

s = pd.Series(index=range(15))

Output:

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
dtype: float64

Or, with a string dtype:

s = pd.Series(index=range(15), dtype='string')

Output:

0     <NA>
1     <NA>
2     <NA>
3     <NA>
4     <NA>
5     <NA>
6     <NA>
7     <NA>
8     <NA>
9     <NA>
10    <NA>
11    <NA>
12    <NA>
13    <NA>
14    <NA>
dtype: string