I'm trying to create a dataframe containing the values from field_1
and field_2
in a single column. I haven't used pandas a whole lot before, so I'm sure this is naive.
# Create a dataset in CSV format
field_names = ["description", "comments"]
writer = csv.writer(open("dataset.csv", "w"), quoting=csv.QUOTE_ALL, delimiter=",")
writer.writerow(field_names)
for instance in Order.objects.all():
writer.writerow([str(getattr(instance, f)) for f in field_names])
# Read CSV
data_frame = pd.read_csv("dataset.csv", index_col=0)
# Combine columns
df2 = data_frame.apply(lambda x: ", ".join(x[x.notnull()]), axis=1)
If I'm working with a fairly large dataset, is there a way I can make this more efficient? I would like to eliminate the step that creates the CSV entirely, if possible. If anybody can point me in the right direction, that'd be fantastic.
CodePudding user response:
You really don't need:
- the order objects and
getattr
; use.values_list()
to get an iterable of 2-tuples (assumingfield_names
are actual fields on the model). - CSV – now that you have an iterable of 2-tuples, pass them to the
DataFrame
constructor along with the respective column names.
field_names = ["description", "comments"]
df = pd.DataFrame.from_records(
Order.objects.all().values_list(field_names),
columns=field_names,
)
- Necessarily even Pandas
from django.db.models import F, Value
from django.db.models.functions import Concat
# ...
my_data = list(
Order.objects.annotate(
x=Concat(
F("description"),
Value(", "),
F("comments"),
)
).values_list("x", flat=True)
)
and you have a list of description, comments
strings, just like the series you'd get with Pandas.