Code to Extract Data Based on Criteria in CSV-CodePudding

I have a comma delimited text file containing 2M records with multiple columns. I would like a way to extract only the columns and rows I need based on values in other columns.

Criteria:
ODBIC = YES
NAM = 1 OR 2

Keep columns:
ACC | NUM | NAM | ODBIC (and remove all the rest)

Sample data below:

INDEX,ACC,NUM,SUBSCRIBED,PN,PDP,NAM,ODBIC
1,37412900,1221222121,0,-1,-1,1,YES
1,37412911,2323232323,0,-1,-1,2,YES
1,374123434,3434343434,0,-1,-1,343,1
1,374129232,-1,0,-1,-1,434,YES

End result:

ACC	NUM	NAM	ODBIC
37412900	1221222121	1	YES
37412911	2323232323	2	YES

As it's 2M records doing in Excel is tedious and time consuming. I came across recommendations to do in Python but not sure how to write the code.

CodePudding user response：

Install pandas module

pip install pandas

CODE

import pandas as pd

file_path = "path_to_csv_file"
data = pd.read_csv(file_path)

data = data[(data["ODBIC"] == "YES") & ((data["NAM"] == 1) | (data["NAM"] == 2))]
data = data[["ACC", "NUM", "NAM", "ODBIC"]]

data.to_csv("result.csv")
print(data)

OUTPUT

        ACC         NUM  NAM ODBIC
0  37412900  1221222121    1   YES
1  37412911  2323232323    2   YES

The results will be saved to result.csv file

CodePudding user response：

Just adding here for completeness - there is another option to use .query() which is sometimes more readable:

import pandas as pd

file_path = "path_to_csv_file"
data = pd.read_csv(file_path)

data = data.query('ODBIC == "YES" and (NAM == 1 or NAM == 2)')
data = data[["ACC", "NUM", "NAM", "ODBIC"]]

data.to_csv("result.csv")
print(data)