I have a CSV file with 3408 rows and 46 columns and I want to fill each of these columns with 0 and 1 randomly but with the limitation of the appearance of the number 1. For example, in column A I have 3408 records but the number 1 should appear in just 15% of the total rows, and in each column, it must be me who gives the percentage of appearance of '1'.
What I have done so far, is I created a CSV file with 3408 rows and 46 columns full fill randomly with '0' and '1' without the percentage,
Any help or suggestion would be great !!
SPEEDING = 15,49%
Driver inattention or decreased alertness in neighborhoods = 13,81%
Pedestrian carelessness when crossing the road = 6.73% .
.
.
# Create 2D Numpy array of 3408 rows and 46 columns,
# filled with random values 0 and 1
random_data = np.random.randint(0,2,size=(3408,46))
# Create a Dataframe with random values
# using 2D numpy Array
df = pd.DataFrame(random_data, columns=['SPEEDING', 'Driver inattention or decreased alertness in neighborhoods',
'Pedestrian carelessness when crossing the road' , 'Unsafe overtaking', 'Loss of vehicle control' ,
'Refusal of priority' , 'Failure to maintain a safe distance' , 'Non-use of crosswalks' ,
'Playing on the road or walking on the side of the road' ,
'Dangerous maneuvers' , 'Drowsiness' , 'Driving without a license' ,
'Driver inattention when passing a motorcycle' , 'Non respect of the direction imposed to the traffic' ,
'Lane change without signalling' , 'Driving under the influence of alcohol or drugs' , 'Non respect of the road signs' ,
'Driver inattention when leaving the parking area' , 'Driver carelessness when reversing' , 'Traffic in the wrong direction' ,
'Non-respect of the stop sign' , 'Unsafe parking or stopping' , 'Dazzle from lights' ,
'Manual use of mobile phone/ Wearing a headset' , 'Pedestrian crossing the railroad track without precaution' ,
'Other Human Factors' , 'Defective tires (burst)' , 'Defective brakes' , 'Mechanical failures' , 'Defective steering system' ,
'Lack of lighting device' , 'Non-regulation lighting device' , 'Overload' , 'Other Vehicle condition Factors' , 'Weather' ,
'Defective road' , 'Animal crossing' , 'Lack of public lighting' , 'Slippery road surface' , 'Bad road design' , 'Potholes' ,
'Glare of the sun' , 'Obstacle on the road' , 'Deformed roadway' ,
'Other State of the road infrastructure and atmospheric conditions' , 'Fatality'])
# Display the Dataframe
print(df)
# Save the Dataframe to a csv file
df.to_csv('test.csv')
CodePudding user response:
You could set everything to 0 and use random.sample
(see doc) to get the row indexes that should be set to 1.
So to get 15% of 3408 rows as 1s (let's round to 511) you could get the list with:
from random import sample
sample(range(0, 3407), 511)
Edit: you can find alternatives under this question
CodePudding user response:
IIUC, try something like this:
1. How I've generated my df:
df = pd.DataFrame(np.zeros(shape=(100, 1)), columns=['speeding'])
2. Then:
df['speeding'] = df['speeding'].apply(lambda x: np.random.choice([0,1], p=[0.85,0.15]))
3. Check:
df.value_counts()
4. Result:
speeding
0 85
1 15
dtype: int64
CodePudding user response:
import numpy as np
import sys
def get_percentages_of_ones(row_count):
""" Create dummy percentage array. """
percentage_of_ones = np.zeros(row_count)
percentage_of_ones[0] = 0.15
percentage_of_ones[1] = 0.50
percentage_of_ones[2] = 1.0
return percentage_of_ones
def create_array(percentage_of_ones, row_count, col_count):
""" ___ """
arr = np.empty([row_count, col_count], dtype="int8")
for row_id, po1 in enumerate(percentage_of_ones):
nb_ones = int(round(po1 * col_count))
nb_zeros = col_count - nb_ones
row = np.append(
np.zeros(nb_zeros, dtype="int8"),
np.ones(nb_ones, dtype="int8")
)
np.random.shuffle(row)
arr[row_id] = row
return arr
def display_array(arr, col_count):
""" ___ """
np.set_printoptions(threshold=sys.maxsize,
edgeitems=col_count,
linewidth=95)
print(np.transpose(arr))
def save_to_csv(fname, data):
""" ___ """
np.savetxt(fname, data, delimiter=",", fmt="%d")
def main():
""" ___ """
row_count = 46
col_count = 3408
percentage_of_ones = get_percentages_of_ones(row_count)
arr = create_array(percentage_of_ones, row_count, col_count)
display_array(arr, col_count)
save_to_csv("test.csv", arr)
if __name__ == "__main__":
main()