Limitation of occurrence of a number in csv file with python-CodePudding

I have a CSV file with 3408 rows and 46 columns and I want to fill each of these columns with 0 and 1 randomly but with the limitation of the appearance of the number 1. For example, in column A I have 3408 records but the number 1 should appear in just 15% of the total rows, and in each column, it must be me who gives the percentage of appearance of '1'.

What I have done so far, is I created a CSV file with 3408 rows and 46 columns full fill randomly with '0' and '1' without the percentage,

Any help or suggestion would be great !!

SPEEDING = 15,49%

Driver inattention or decreased alertness in neighborhoods = 13,81%

Pedestrian carelessness when crossing the road = 6.73% .

# Create 2D Numpy array of 3408 rows and 46 columns,
# filled with random values 0 and 1 
random_data = np.random.randint(0,2,size=(3408,46))
# Create a Dataframe with random values
# using 2D numpy Array
df = pd.DataFrame(random_data, columns=['SPEEDING', 'Driver inattention or decreased alertness in neighborhoods',
                                        'Pedestrian carelessness when crossing the road' , 'Unsafe overtaking', 'Loss of vehicle control' ,
                                        'Refusal of priority' , 'Failure to maintain a safe distance' , 'Non-use of crosswalks' ,
                                        'Playing on the road or walking on the side of the road' ,
                                        'Dangerous maneuvers' , 'Drowsiness' , 'Driving without a license' ,
                                        'Driver inattention when passing a motorcycle' , 'Non respect of the direction imposed to the traffic' ,
                                        'Lane change without signalling' , 'Driving under the influence of alcohol or drugs' , 'Non respect of the road signs' ,
                                        'Driver inattention when leaving the parking area' , 'Driver carelessness when reversing' , 'Traffic in the wrong direction' ,
                                        'Non-respect of the stop sign' , 'Unsafe parking or stopping' , 'Dazzle from lights' ,
                                        'Manual use of mobile phone/ Wearing a headset' , 'Pedestrian crossing the railroad track without precaution' ,
                                        'Other Human Factors' , 'Defective tires (burst)' , 'Defective brakes' , 'Mechanical failures' , 'Defective steering system' ,
                                        'Lack of lighting device' , 'Non-regulation lighting device' , 'Overload' , 'Other Vehicle condition Factors' , 'Weather' ,
                                        'Defective road' , 'Animal crossing' , 'Lack of public lighting' , 'Slippery road surface' , 'Bad road design' , 'Potholes' ,
                                        'Glare of the sun' , 'Obstacle on the road' , 'Deformed roadway' ,
                                        'Other  State of the road infrastructure and atmospheric conditions' , 'Fatality'])
# Display the Dataframe
print(df)

# Save the Dataframe to a csv file
df.to_csv('test.csv')

CodePudding user response：

You could set everything to 0 and use random.sample (see doc) to get the row indexes that should be set to 1.

So to get 15% of 3408 rows as 1s (let's round to 511) you could get the list with:

from random import sample

sample(range(0, 3407), 511)

Edit: you can find alternatives under this question

CodePudding user response：

IIUC, try something like this:

1. How I've generated my df:

df = pd.DataFrame(np.zeros(shape=(100, 1)), columns=['speeding'])

2. Then:

df['speeding'] = df['speeding'].apply(lambda x: np.random.choice([0,1], p=[0.85,0.15]))

3. Check:

df.value_counts()

4. Result:

speeding
0           85
1           15
dtype: int64

CodePudding user response：

import numpy as np
import sys


def get_percentages_of_ones(row_count):
    """ Create dummy percentage array. """
    percentage_of_ones = np.zeros(row_count)
    percentage_of_ones[0] = 0.15
    percentage_of_ones[1] = 0.50
    percentage_of_ones[2] = 1.0
    return percentage_of_ones


def create_array(percentage_of_ones, row_count, col_count):
    """ ___ """
    arr = np.empty([row_count, col_count], dtype="int8")
    for row_id, po1 in enumerate(percentage_of_ones):
        nb_ones = int(round(po1 * col_count))
        nb_zeros = col_count - nb_ones
        row = np.append(
            np.zeros(nb_zeros, dtype="int8"),
            np.ones(nb_ones, dtype="int8")
        )
        np.random.shuffle(row)
        arr[row_id] = row
    return arr


def display_array(arr, col_count):
    """ ___ """
    np.set_printoptions(threshold=sys.maxsize,
                        edgeitems=col_count,
                        linewidth=95)
    print(np.transpose(arr))


def save_to_csv(fname, data):
    """ ___ """
    np.savetxt(fname, data, delimiter=",", fmt="%d")


def main():
    """ ___ """
    row_count = 46
    col_count = 3408
    percentage_of_ones = get_percentages_of_ones(row_count)
    arr = create_array(percentage_of_ones, row_count, col_count)
    display_array(arr, col_count)
    save_to_csv("test.csv", arr)


if __name__ == "__main__":
    main()