Remove columns starting with same special character in a csv file using Python-CodePudding

My csv file has below columns: AFM_reversal_indicator,Alert_Message,axiom_key,_timediff,player,__mv_Splunk_Alert_Id,__mv_nbr_plastic,__mv_code_acct_stat_demo.

I want to remove columns starting with "__mv". I saw some posts where pandas are used to filter out columns. Is it possible to do it using csv module in python, If yes how ? Also, with Pandas what regex should I give:

df.filter(regex='')
df.to_csv(output_file_path)

P.S I am using python3.8

CodePudding user response：

You don't need to use .filter for that. You can just find out which are those columns and then drop them from the DataFrame

import pandas as pd

# Load the dataframe (In our case create a dummy one with your columns)
df = pd.DataFrame(columns = ["AFM_reversal_indicator", "Alert_Message,axiom_key", "_timediff,player",  "__mv_Splunk_Alert_Id", "__mv_nbr_plastic", "__mv_code_acct_stat_demo"])


# Get a list of all column names starting with "__mv"
mv_columns = [col for col in df.columns if col.startswith("__mv")]

# Drop the columns
df = df.drop(columns=mv_columns)

# Save the updated dataframe to a CSV file
df.to_csv("cleaned_data.csv", index=False)

The mv_columns will iterate through the columns in your DataFrame and pick those that starts with "__mv". Then the .drop will just remove those from it.

If for some reason you want to use csv package only, then the solution might not be as elegant as with pandas. But here is a suggestion:

import csv

with open("original_data.csv", "r") as input_file, open("cleaned_data.csv", "w", newline="") as output_file:

    reader = csv.reader(input_file)
    writer = csv.writer(output_file)

    header_row = next(reader)

    mv_columns = [col for col in header_row if col.startswith("__mv")]

    mv_column_indices = [header_row.index(col) for col in mv_columns]

    new_header_row = [col for col in header_row if col not in mv_columns]

    writer.writerow(new_header_row)

    for row in reader:
        new_row = [row[i] for i in range(len(row)) if i not in mv_column_indices]

        writer.writerow(new_row)

So, first, you read the first row that supposed to be your headers. With a similar logic as before, you find those columns that starts with "__mv" and then you get their indices. You write the new columns to your output file with those columns that don't exist to the "__mv" columns. Then you need to iterate through the rest of the CSV and remove those columns as you go.

CodePudding user response：

You mean with standard python? You can use a list comprehension, e.g.

import csv

with open( 'data.csv', 'r' ) as f:
    DataGenerator = csv.reader( f )
    Header = next( DataGenerator )
    Header = [ Col.strip() for Col in Header ]
    Data = list( DataGenerator )
    if Data[-1] == []: del( Data[-1] )
    Data = [ [Row[i] for i in range( len( Header ) ) if not Header[i].startswith( "__mv" ) ] for Row in Data ]
    Header = [ Col for Col in Header if not Col.startswith( "__mv" ) ]

However, this is just a simple example. You'll probably have further things to consider, e.g. what type your csv columns have, whether you want to read all the data at once like I do here, or one-by-one from the generator to save on memory, etc.

You could also use the builtin filter command instead of the inner list comprehension.

Also, if you have numpy installed and you wanted something more 'numerical', you can always use "structured numpy arrays" (https://numpy.org/doc/stable/user/basics.rec.html). They're quite nice. (personally I prefer them to pandas anyway). Numpy also has its own csv-reading functions (see: https://www.geeksforgeeks.org/how-to-read-csv-files-with-numpy/)