I have a comma delimited text file containing 2M records with multiple columns. I would like a way to extract only the columns and rows I need based on values in other columns.
Criteria:
ODBIC = YES
NAM = 1 OR 2
Keep columns:
ACC | NUM | NAM | ODBIC (and remove all the rest)
Sample data below:
INDEX,ACC,NUM,SUBSCRIBED,PN,PDP,NAM,ODBIC
1,37412900,1221222121,0,-1,-1,1,YES
1,37412911,2323232323,0,-1,-1,2,YES
1,374123434,3434343434,0,-1,-1,343,1
1,374129232,-1,0,-1,-1,434,YES
End result:
ACC | NUM | NAM | ODBIC |
---|---|---|---|
37412900 | 1221222121 | 1 | YES |
37412911 | 2323232323 | 2 | YES |
As it's 2M records doing in Excel is tedious and time consuming. I came across recommendations to do in Python but not sure how to write the code.
CodePudding user response:
Install pandas
module
pip install pandas
CODE
import pandas as pd
file_path = "path_to_csv_file"
data = pd.read_csv(file_path)
data = data[(data["ODBIC"] == "YES") & ((data["NAM"] == 1) | (data["NAM"] == 2))]
data = data[["ACC", "NUM", "NAM", "ODBIC"]]
data.to_csv("result.csv")
print(data)
OUTPUT
ACC NUM NAM ODBIC
0 37412900 1221222121 1 YES
1 37412911 2323232323 2 YES
The results will be saved to result.csv
file
CodePudding user response:
Just adding here for completeness - there is another option to use .query()
which is sometimes more readable:
import pandas as pd
file_path = "path_to_csv_file"
data = pd.read_csv(file_path)
data = data.query('ODBIC == "YES" and (NAM == 1 or NAM == 2)')
data = data[["ACC", "NUM", "NAM", "ODBIC"]]
data.to_csv("result.csv")
print(data)