I have imported the document using pandas and output can be seen. It is basically sentences separated by commas as they are in CSV format.
df= pd.read_csv("Document.csv", header=None,)
df.columns = ["Document"]
df
Output:
Document
0 Hello my name is Jhon
1 I live in USA
2 I work at apple
3 I like fruits
Now I want to tokenize the full document into word tokens such as
result=['Hello','my','name','is','Jhon','I','live','in'.......]
How can I do that?
CodePudding user response:
result = [token for line in df.values() for token in line.split(' ') ]
But something looks fishy: why is the file a csv if it’s not a table? What happens if there is a comma in the file?
CodePudding user response:
You can use str.split
and itertools.chain
:
from itertools import chain
out = list(chain.from_iterable(df['Document'].str.split()))
Or using str.split
with expand=True
combined with stack
:
out = df['Document'].str.split(expand=True).stack().to_list()
output:
['Hello', 'my', 'name', 'is', 'Jhon', 'I', 'live', 'in', 'USA', 'I', 'work', 'at', 'apple', 'I', 'like', 'fruits']