Home > Blockchain >  Add list to existing csv
Add list to existing csv

Time:03-30

thanks for looking into this:

I am trying to add a list of calculated values (mean) as a new column to an existing csv.

Here is my MWE:

import csv
import re
import pandas as pd
import oseti
import numpy as np

# handle csv data
df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()
dtype_before = type(df["text"])
text_list = df["text"].tolist()

# create df for sentiment analysis
list_sa = (np.mean(list(map(analyzer.analyze,text_list))).tolist())
df_sa = pd.DataFrame (list_sa, columns = ['sa_mean'])
print (df_sa)

this part works (although I receive a warning:

Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

) and prints out the values correctly (since I am quite new, I wanted to make sure it looks like what I want it to). The results printed will look somewhat like this:

    sa_mean
0   0.000000
1   0.000000
2   0.000000
3  -0.018519
4   0.037037

However, if I instead of print try to get it as a new column to the originally loaded csv ('filepath/text.csv') I am not sure how to tackle it (is it necessary to make it a DataFrame or a Series?)

I tried this (instead of the last print line

df["new_column"] = df_sa
df.to_csv("text.csv", index=False)

However, I receive an error - the csv is still created, but I would like to understand if there is something wrong:

Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.

I am not really sure why this is happening and how to fix.

Thank you in advance!


Edit:

print(list_sa) will look like this:

[0.0, 0.0, 0.0, -0.018518518518518517, 0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.012345679012345678, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, 0.0, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, -0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.037037037037037035, 0.0, 0.037037037037037035, -0.037037037037037035, 0.0, 0.0, -0.037037037037037035, 0.0, 0.037037037037037035, 0.0, 0.0, -0.037037037037037035, -0.024691358024691357]

CodePudding user response:

Use list comprehension with np.mean and assign to new column, df_sa is not necessary here:

df = pd.read_csv('filepath/text.csv')
analyzer = oseti.Analyzer()

df['new_column'] = [np.mean(analyzer.analyze(x)) for x in df['text']]

Or create lambda function:

df['new_column'] = df['text'].apply(lambda x: np.mean(analyzer.analyze(x)))

df.to_csv("text.csv", index=False)

CodePudding user response:

Is it possible to tell which statement produces the warning? You may have to run the lines one by one, or with prints between them (if running the script).

I suspect it's the

np.mean(list(map(analyzer.analyze,text_list))

The warning means that you (or something called by your code) is trying to make an array from lists that vary in length. For example:

In [245]: alist = [[1,2,3],[4,5],[6]]
In [246]: alist
Out[246]: [[1, 2, 3], [4, 5], [6]]
In [247]: np.array(alist)
<ipython-input-247-7512d762195a>:1: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  np.array(alist)
Out[247]: array([list([1, 2, 3]), list([4, 5]), list([6])], dtype=object)

The result is a 1d array, with object dtype. It cannot make a 2d array from such a list.

Attempting to do mean on that list, produces the same warning, since it first has to make an array:

In [248]: np.mean(alist)
/usr/local/lib/python3.8/dist-packages/numpy/core/_methods.py:163: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  arr = asanyarray(a)
Out[248]: 
array([0.33333333, 0.66666667, 1.        , 1.33333333, 1.66666667,
       2.        ])

A warning doesn't give a traceback like an error would, but it does show the action that raised the warning. Also the mean values are off - the lists have been 'flattened' but the divisor is 3!

As jezrael suggested, we can get the means for the sublists with:

In [249]: [np.mean(x) for x in alist]
Out[249]: [2.0, 4.5, 6.0]
  • Related