Home > Net >  How to delete an entire line from a data set if the string from a specific column contains a certain
How to delete an entire line from a data set if the string from a specific column contains a certain

Time:01-01

I'm trying to read the text from a column called location in a CSV file, and then retrieve all the lines that do not contain the specific word(s) in that column, but the column that I want to read has the city name as well as the country name.

Here are the column names and example lines of data from the CSV file:

company // level // title              // location
Amazon  // L5    // Solution Architect // Sydney, NS, Australia
IBM     // Band  // Analyst            // Chicago, IL
Oracle  // M3    // Architect          // San Francisco, CA
Ubisoft // L4    // Manager            // Tokyo, TY, Japan

Here is the output I am looking for:

company // level // title              // location
IBM     // Band  // Analyst            // Chicago, IL
Oracle  // M3    // Architect          // San Francisco, CA

I have something like this so far:

dataset = pd.read_csv("raw_data.csv")  
other_countries = ['Australia', 'Brazil', 'Japan']

After this point, I feel very lost about how to go about reading the specific column, searching for the particular words in that column, and then returning the lines of data that I am interested in.

CodePudding user response:

I personally like using boolean masks for stuff like this.

sort_out = ['Australia', 'Japan', 'Brazil']
col = df['location']
bool_mask = [not any([y in sort_out for y in x.replace(' ','').split(',')]) for x in col]
df = df[bool_mask]

A boolean mask is simply a list of boolean values. In this case representing a row in a dataframe. When applied if the value is True the row is displayed. If it's False it's hidden. You can use this to perform a check on each row to see if it matches any number of conditions. I can explain the process further if you need.

CodePudding user response:

Try this code , it doesn't require any data frame manipulation module

other_countries = ['Australia', 'Brazil', 'Japan']

file = open("test.csv" , "r")

file = file.read().splitlines()

file.pop(0)

new_list = []

for line in file :
    location = line.split("//")[3]
    t = 0
    flag = 0
    while t < len(other_countries) :
        if other_countries[t] in location :
            flag = flag   1
        t= t   1

    if flag == 0 :
        new_list.append(line)

new_file = open("newcsvfile" , "w")
for item in new_list :
    new_file.write(item   "\n")

new_file.close()
  • Related