This is easy to do in R and I am wondering if it is straight forward in Python and I am just missing something, but how do you create a vector of NaN values and Null values in Python? I am trying to do this using the np.full function.
R Code:
vec <- vector("character", 15)
vec[1:15] <- NA
vec
Python Code
unknowns = np.full(shape = 5, fill_value = ???, dtype = 'str')
'''test if fill value worked or not'''
random.seed(1177)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])
example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]
print(example['transformed'].value_counts())
This should lead to 5 counts of unknown in the value counts total. Ideally I would like to know how to write this fill_value for NaN and Null and know whether it differs for variable types. I have tried np.nan with and without the string data type. I have tried None and Null with and without quotes. I cannot think of anything else to try and I am starting to wonder if it is possible. Thank you in advance and I apologize if this question is already addressed and for my lack of knowledge in this area.
CodePudding user response:
you could use either None
or np.nan
to create an array of just missing values in Python like so:
np.full(shape=5, fill_value=None)
np.full(shape=5, fill_value=np.nan)
back to your example, this works just fine:
import numpy as np
import pandas as pd
unknowns = np.full(shape=5, fill_value=None)
categories = np.random.choice(['web', 'software', 'hardware', 'biotech'], size = 15, replace = True)
categories = np.concatenate([categories, unknowns])
example = pd.DataFrame(data = {'categories': categories})
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]
print(example['transformed'].value_counts())
Lastly, this line is inefficient.
example['transformed'] = [ x if pd.isna(x) == False else 'unknown' for x in example['categories']]
You do want to avoid loops & list comprehensions when using pandas
on large data, this is going to run much faster:
example['transformed'] = example.categories.apply(lambda s: s if s else 'unknown')
CodePudding user response:
There is a typing problem here.
If you're working in numpy
, vectors are typed after being initialized. Assigning a np.nan
value to a vector initialized with strings will try to coalesce back into a string:
import numpy as np
v1 = np.array(['a', 'b', 'c'])
v1[0] = np.nan
# v1 = array(['n', 'b', 'c'], dtype='<U1')
v2 = np.array(['ab', 'cd', 'ef'])
v2[0] = np.nan
# v2 = array(['na', 'cd', 'ef'], dtype='<U2')
v3 = np.array(['abc', 'def', 'ghi'])
v3[0] = np.nan
# v3 = array(['nan', 'def', 'ghi'], dtype='<U3')
However, if you're working with pandas
in the second half of the question, there's a separate way for handling missing data:
import pandas as pd
df = pd.DataFrame({"x": [pd.NA, "Hello", "World"]})
CodePudding user response:
A simple way to create an empty Series in pandas:
s = pd.Series(index=range(15))
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
dtype: float64
Or, with a string dtype:
s = pd.Series(index=range(15), dtype='string')
Output:
0 <NA>
1 <NA>
2 <NA>
3 <NA>
4 <NA>
5 <NA>
6 <NA>
7 <NA>
8 <NA>
9 <NA>
10 <NA>
11 <NA>
12 <NA>
13 <NA>
14 <NA>
dtype: string