Convert a series, every element is a list (different lengths) to data frame-CodePudding

I would like to convert to a count data frame from a series (or a list) in which every element is a list having different lengths.

Input:

series_x = pd.Series([
    ['a', 'b', 'c'],
    ['a', 'a', 'd'],
    ['b', 'c']
])

print(series_x)
0    [a, b, c]
1    [a, a, d]
2       [b, c]
dtype: object

Desired output:

print(df_x)
    a   b   c   d
0   1   1   1   0
1   2   0   0   1
2   0   1   1   0

CodePudding user response：

Use:

series_x = pd.Series([
    ['aa', 'bb', 'cc'],
    ['aa', 'dd'],
    ['bb', 'cc']
])
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0)
X = vectorizer.fit_transform([' '.join(x) for x in series_x])
print(X.toarray())

Output:

Note that, it apears that your data is for demonstration, so i just changed it a bit. In the case that your data really are one-character length, just set the following parameter, correctly:

token_patternstr, default=r”(?u)\b\w\w \b”

The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).

CodePudding user response：

I firmly believe that there should be a way more elegant approach than what I have come up with, but if you are interested in a pure pythonic way, you can use what follows:

columns = []
for l in series_x:
  for chr in l:
    if chr not in columns:columns.append(chr)
values = []
for l in series_x:
  tempList = []
  for chr in columns:
    tempList.append(l.count(chr))
  values.append(tempList)

df = pd.DataFrame(values, columns=columns)
df

Output

	a	b	c	d
0	1	1	1	0
1	2	0	0	1
2	0	1	1	0