I have a dataset like this, a .txt file
للہ عمران خان کو ہماری بھی عمر لگائے ہم جیسے تو اس ملک میں اور بھی پچیس کروڑ ہیں مگر خان آپ جیسا تو یہاں دوسرا نہیں ۔۔۔ اللہ آپکی حفاظت فرمائے آمین
[Real,politics,sarcasm ,rise moral]
how can I convert into data frame into two columns, English text in column one and Urdu text in column two?
Thanks!
CodePudding user response:
multiple text files each file having data like this. Urdu, English-in-brackets
So start with a function that reads a single file of that type:
def read_single_file(filename: str) -> tuple[str, str]:
urdu = ""
english = ""
with open(filename) as f:
for line in f:
line = line.strip() # remove newlines etc.
if not line: # ignore empty lines
continue
if line.startswith("["):
english = line.strip("[]")
else:
urdu = line
return (urdu, english)
Then, loop over your files; I'll assume they're just *.txt
:
import glob
results = [read_single_file(filename) for filename in glob.glob("*.txt")]
Now that you have a list of 2-tuples, you can just create a dataframe out of it:
import pandas as pd
df = pd.DataFrame(results, columns=["urdu", "english"])