thanks for looking into this:
I am trying to add a list of calculated values (mean) as a new column to an existing csv.
Here is my MWE:
import csv
import re
import pandas as pd
import oseti
import numpy as np
# handle csv data
df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()
dtype_before = type(df["text"])
text_list = df["text"].tolist()
# create df for sentiment analysis
list_sa = (np.mean(list(map(analyzer.analyze,text_list))).tolist())
df_sa = pd.DataFrame (list_sa, columns = ['sa_mean'])
print (df_sa)
this part works (although I receive a warning:
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
) and prints out the values correctly (since I am quite new, I wanted to make sure it looks like what I want it to). The results printed will look somewhat like this:
sa_mean
0 0.000000
1 0.000000
2 0.000000
3 -0.018519
4 0.037037
However, if I instead of print try to get it as a new column to the originally loaded csv ('filepath/text.csv') I am not sure how to tackle it (is it necessary to make it a DataFrame or a Series?)
I tried this (instead of the last print line
df["new_column"] = df_sa
df.to_csv("text.csv", index=False)
However, I receive an error - the csv is still created, but I would like to understand if there is something wrong:
Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
I am not really sure why this is happening and how to fix.
Thank you in advance!
Edit:
print(list_sa) will look like this:
[0.0, 0.0, 0.0, -0.018518518518518517, 0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.012345679012345678, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.024691358024691357]
CodePudding user response:
Use list comprehension with np.mean
and assign to new column, df_sa
is not necessary here:
df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()
df['new_column'] = [np.mean(analyzer.analyze(x)) for x in df['text']]
Or create lambda function:
df['new_column'] = df['text'].apply(lambda x: np.mean(analyzer.analyze(x)))
df.to_csv("text.csv", index=False)
CodePudding user response:
Is it possible to tell which statement produces the warning? You may have to run the lines one by one, or with prints between them (if running the script).
I suspect it's the
np.mean(list(map(analyzer.analyze,text_list))
The warning means that you (or something called by your code) is trying to make an array from lists that vary in length. For example:
In [245]: alist = [[1,2,3],[4,5],[6]]
In [246]: alist
Out[246]: [[1, 2, 3], [4, 5], [6]]
In [247]: np.array(alist)
<ipython-input-247-7512d762195a>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
np.array(alist)
Out[247]: array([list([1, 2, 3]), list([4, 5]), list([6])], dtype=object)
The result is a 1d array, with object dtype. It cannot make a 2d array from such a list.
Attempting to do mean on that list, produces the same warning, since it first has to make an array:
In [248]: np.mean(alist)
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:163: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
arr = asanyarray(a)
Out[248]:
array([0.33333333, 0.66666667, 1. , 1.33333333, 1.66666667,
2. ])
A warning doesn't give a traceback like an error would, but it does show the action that raised the warning. Also the mean values are off - the lists have been 'flattened' but the divisor is 3!
As jezrael suggested, we can get the means for the sublists with:
In [249]: [np.mean(x) for x in alist]
Out[249]: [2.0, 4.5, 6.0]