Home > Blockchain >  How to tokenize dataframe into word tokens?
How to tokenize dataframe into word tokens?

Time:06-12

I have imported the document using pandas and output can be seen. It is basically sentences separated by commas as they are in CSV format.

df=  pd.read_csv("Document.csv", header=None,)
df.columns = ["Document"]
df

Output:

                 Document
0   Hello my name is Jhon
1   I live in USA
2   I work at apple
3   I like fruits

Now I want to tokenize the full document into word tokens such as

result=['Hello','my','name','is','Jhon','I','live','in'.......]

How can I do that?

CodePudding user response:

result = [token for line in df.values() for token in line.split(' ') ]

But something looks fishy: why is the file a csv if it’s not a table? What happens if there is a comma in the file?

CodePudding user response:

You can use str.split and itertools.chain:

from itertools import chain
out = list(chain.from_iterable(df['Document'].str.split()))

Or using str.split with expand=True combined with stack:

out = df['Document'].str.split(expand=True).stack().to_list()

output:

['Hello', 'my', 'name', 'is', 'Jhon', 'I', 'live', 'in', 'USA', 'I', 'work', 'at', 'apple', 'I', 'like', 'fruits']
  • Related