I would like to convert to a count data frame from a series (or a list) in which every element is a list having different lengths.
Input:
series_x = pd.Series([
['a', 'b', 'c'],
['a', 'a', 'd'],
['b', 'c']
])
print(series_x)
0 [a, b, c]
1 [a, a, d]
2 [b, c]
dtype: object
Desired output:
print(df_x)
a b c d
0 1 1 1 0
1 2 0 0 1
2 0 1 1 0
CodePudding user response:
Use:
series_x = pd.Series([
['aa', 'bb', 'cc'],
['aa', 'dd'],
['bb', 'cc']
])
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=0)
X = vectorizer.fit_transform([' '.join(x) for x in series_x])
print(X.toarray())
Output:
Note that, it apears that your data is for demonstration, so i just changed it a bit. In the case that your data really are one-character length, just set the following parameter, correctly:
token_patternstr, default=r”(?u)\b\w\w \b”
The default regexp select tokens of 2 or more alphanumeric characters (punctuation is completely ignored and always treated as a token separator).
CodePudding user response:
I firmly believe that there should be a way more elegant approach than what I have come up with, but if you are interested in a pure pythonic way, you can use what follows:
columns = []
for l in series_x:
for chr in l:
if chr not in columns:columns.append(chr)
values = []
for l in series_x:
tempList = []
for chr in columns:
tempList.append(l.count(chr))
values.append(tempList)
df = pd.DataFrame(values, columns=columns)
df
Output
a | b | c | d | |
---|---|---|---|---|
0 | 1 | 1 | 1 | 0 |
1 | 2 | 0 | 0 | 1 |
2 | 0 | 1 | 1 | 0 |