Regex in Python - Extracting message from Tweet Data in a CSV file-CodePudding

I have a CSV file named tweets.csv.

From this file I want to extract just the message details. Within the file there are 9 headers;

id,link,content,date,retweets,favorites,mentions,hashtags,geo

An example of one tweet;

1698308935,https://twitter.com/realDonaldTrump/status/1698308935,Be sure to tune in and watch Donald Trump on Late Night with David Letterman as he presents the Top Ten List tonight!,2009-05-04 20:54:25,500,868,,,

I want to extract the content for each tweet in the file. Using regex I am struggling to identify the content data;

I have tried the below code;

f1 = open("tweets.csv","r")
for line in f1:
    id = re.findall(r"status/d",line)
    print(id.group())

Also the below but with not much success - I am new to this and would appreciate any assistance.

pattern = re.findall(r"status/d","tweets.csv")
#print(pattern())

CodePudding user response：

Try using this regex pattern ^\d*,[\w:/.]*,([^,]*). It will capture any chars except comma after the second commas. See this demo and explanations.

Another solution which is also the easiest to use is by using Pandas library.

import pandas as pd

df = pd.read_csv('tweets.csv')
print(df['content'])

CodePudding user response：

If you have a CSV file, you need no Pandas package, use the csv module:

with open("tweets.csv", "r") as f:
    reader = csv.reader(f, delimiter=",")
    with open("tweets_contents.txt", "w") as out_file:
        for row in reader:
            out_file.write(row[2]   "\n")

Here, row[2] is the third column in your CSV file, it will be saved in a tweets_contents.txt file.

If your files are in UTF8 encoding, pass encoding="utf8" argument to open.