Home > front end >  How to create an array of NA or Null values in Python?
How to create an array of NA or Null values in Python?

Time:11-29

This is easy to do in R and I am wondering if it is straight forward in Python and I am just missing something, but how do you create a vector of NaN values and Null values in Python? I am trying to do this using the np.full function.

R Code:

vec <- vector("character", 15)
vec[1:15] <- NA
vec

Python Code

unknowns = np.full(shape = 5, fill_value = ???, dtype = 'str')

'''test if fill value worked or not'''

random.seed(1177)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])

example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

print(example['transformed'].value_counts())

This should lead to 5 counts of unknown in the value counts total. Ideally I would like to know how to write this fill_value for NaN and Null and know whether it differs for variable types. I have tried np.nan with and without the string data type. I have tried None and Null with and without quotes. I cannot think of anything else to try and I am starting to wonder if it is possible. Thank you in advance and I apologize if this question is already addressed and for my lack of knowledge in this area.

CodePudding user response:

you could use either None or np.nan to create an array of just missing values in Python like so:

np.full(shape=5, fill_value=None)
np.full(shape=5, fill_value=np.nan)

back to your example, this works just fine:

import numpy as np
import pandas as pd

unknowns = np.full(shape=5, fill_value=None)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])
example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

print(example['transformed'].value_counts())

Lastly, this line is inefficient. example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]

You do want to avoid loops & list comprehensions when using pandas

on large data, this is going to run much faster: example['transformed'] = example.categories.apply(lambda s: s if s else 'unknown')

CodePudding user response:

There is a typing problem here.

If you're working in numpy, vectors are typed after being initialized. Assigning a np.nan value to a vector initialized with strings will try to coalesce back into a string:

import numpy as np

v1 = np.array(['a', 'b', 'c'])
v1[0] = np.nan
# v1 = array(['n', 'b', 'c'], dtype='<U1')

v2 = np.array(['ab', 'cd', 'ef'])
v2[0] = np.nan
# v2 = array(['na', 'cd', 'ef'], dtype='<U2')

v3 = np.array(['abc', 'def', 'ghi'])
v3[0] = np.nan
# v3 = array(['nan', 'def', 'ghi'], dtype='<U3')

However, if you're working with pandas in the second half of the question, there's a separate way for handling missing data:

import pandas as pd

df = pd.DataFrame({"x": [pd.NA, "Hello", "World"]})

CodePudding user response:

A simple way to create an empty Series in pandas:

s = pd.Series(index=range(15))

Output:

0     NaN
1     NaN
2     NaN
3     NaN
4     NaN
5     NaN
6     NaN
7     NaN
8     NaN
9     NaN
10    NaN
11    NaN
12    NaN
13    NaN
14    NaN
dtype: float64

Or, with a string dtype:

s = pd.Series(index=range(15), dtype='string')

Output:

0     <NA>
1     <NA>
2     <NA>
3     <NA>
4     <NA>
5     <NA>
6     <NA>
7     <NA>
8     <NA>
9     <NA>
10    <NA>
11    <NA>
12    <NA>
13    <NA>
14    <NA>
dtype: string
  • Related