Add list to existing csv-CodePudding

thanks for looking into this:

I am trying to add a list of calculated values (mean) as a new column to an existing csv.

Here is my MWE:

import csv
import re
import pandas as pd
import oseti
import numpy as np

# handle csv data
df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()
dtype_before = type(df["text"])
text_list = df["text"].tolist()

# create df for sentiment analysis
list_sa = (np.mean(list(map(analyzer.analyze,text_list))).tolist())
df_sa = pd.DataFrame (list_sa, columns = ['sa_mean'])
print (df_sa)

this part works (although I receive a warning:

Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

) and prints out the values correctly (since I am quite new, I wanted to make sure it looks like what I want it to). The results printed will look somewhat like this:

    sa_mean
0   0.000000
1   0.000000
2   0.000000
3  -0.018519
4   0.037037

However, if I instead of print try to get it as a new column to the originally loaded csv ('filepath/text.csv') I am not sure how to tackle it (is it necessary to make it a DataFrame or a Series?)

I tried this (instead of the last print line

df["new_column"] = df_sa
df.to_csv("text.csv", index=False)

However, I receive an error - the csv is still created, but I would like to understand if there is something wrong:

Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

I am not really sure why this is happening and how to fix.

Thank you in advance!

Edit:

print(list_sa) will look like this:

[0.0, 0.0, 0.0, -0.018518518518518517, 0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.012345679012345678, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.024691358024691357]

CodePudding user response：

Use list comprehension with np.mean and assign to new column, df_sa is not necessary here:

df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()

df['new_column'] = [np.mean(analyzer.analyze(x)) for x in df['text']]

Or create lambda function:

df['new_column'] = df['text'].apply(lambda x: np.mean(analyzer.analyze(x)))

df.to_csv("text.csv", index=False)

CodePudding user response：

Is it possible to tell which statement produces the warning? You may have to run the lines one by one, or with prints between them (if running the script).

I suspect it's the

np.mean(list(map(analyzer.analyze,text_list))

The warning means that you (or something called by your code) is trying to make an array from lists that vary in length. For example:

In [245]: alist = [[1,2,3],[4,5],[6]]
In [246]: alist
Out[246]: [[1, 2, 3], [4, 5], [6]]
In [247]: np.array(alist)
<ipython-input-247-7512d762195a>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  np.array(alist)
Out[247]: array([list([1, 2, 3]), list([4, 5]), list([6])], dtype=object)

The result is a 1d array, with object dtype. It cannot make a 2d array from such a list.

Attempting to do mean on that list, produces the same warning, since it first has to make an array:

In [248]: np.mean(alist)
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:163: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  arr = asanyarray(a)
Out[248]: 
array([0.33333333, 0.66666667, 1.        , 1.33333333, 1.66666667,
       2.        ])

A warning doesn't give a traceback like an error would, but it does show the action that raised the warning. Also the mean values are off - the lists have been 'flattened' but the divisor is 3!

As jezrael suggested, we can get the means for the sublists with:

In [249]: [np.mean(x) for x in alist]
Out[249]: [2.0, 4.5, 6.0]