I have a dataframe which contains the following data (only 3 samples are provided here):
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
# Create DataFrame
df = pd.DataFrame(data)
Basically, each row contains a tuple of top 10 words along with their frequencies in each of Department.
I wanted to create a dataframe where (let department name get repeated and) each row contains the word from tuple in one column and frequency count in the other columns so that it should look like something this:
Department Word Counts
D1 cat 6
D1 project 6
D1 dog 6
D1 develop 4
D1 smooth 4
D1 efficient 4
D1 administrative 4
D1 procedure 4
D1 establishment 3
D1 matter 3
D2 management 21
D2 satisfaction 12
D2 within 9
D2 budget 9
D2 township 9
Is there any work around this type of conversion?
CodePudding user response:
I'd suggest you do the wrangling before loading into a dataframe, with the data
dictionary:
length = [len(entry) for entry in data['TopWords']]
department = {'Department' : np.repeat(data['Department'], length)}
(pd
.DataFrame([ent for entry in data['TopWords'] for ent in entry],
columns = ['Word', 'Counts'])
.assign(**department)
)
Word Counts Department
0 cat 6 D1
1 project 6 D1
2 dog 6 D1
3 develop 4 D1
4 smooth 4 D1
5 efficient 4 D1
6 administrative 4 D1
7 procedure 4 D1
8 establishment 3 D1
9 matter 3 D1
10 management 21 D2
11 satisfaction 12 D2
12 within 9 D2
13 budget 9 D2
14 township 9 D2
15 site 9 D2
16 periodic 9 D2
17 admin 9 D2
18 maintain 9 D2
19 guest 6 D2
20 manage 2 D3
21 ir 2 D3
22 mines 2 D3
23 implimentation 2 D3
24 clrg 2 D3
25 act 2 D3
26 implementations 2 D3
27 office 2 D3
28 maintenance 2 D3
29 administration 2 D3
CodePudding user response:
First, use DataFrame.explode
to separate the list elements into different rows. Then split the tuples into different columns, e.g. using DataFrame.assign
Series.str
res = (
df.explode('TopWords', ignore_index=True)
.assign(Word=lambda df: df['TopWords'].str[0],
Counts=lambda df: df['TopWords'].str[1])
.drop(columns='TopWords')
)
Output:
>>> res
Department Word Counts
0 D1 cat 6
1 D1 project 6
2 D1 dog 6
3 D1 develop 4
4 D1 smooth 4
5 D1 efficient 4
6 D1 administrative 4
7 D1 procedure 4
8 D1 establishment 3
9 D1 matter 3
10 D2 management 21
11 D2 satisfaction 12
12 D2 within 9
13 D2 budget 9
14 D2 township 9
15 D2 site 9
16 D2 periodic 9
17 D2 admin 9
18 D2 maintain 9
19 D2 guest 6
20 D3 manage 2
21 D3 ir 2
22 D3 mines 2
23 D3 implimentation 2
24 D3 clrg 2
25 D3 act 2
26 D3 implementations 2
27 D3 office 2
28 D3 maintenance 2
29 D3 administration 2
As @sammywemmy suggested, if you are dealing with a considerable amount of data, it will be faster if you wrangle it before loading it into a DataFrame.
Another way of doing it using a nested loop
data = {'Department' : ['D1', 'D2', 'D3'],
'TopWords' : [[('cat', 6), ('project', 6), ('dog', 6), ('develop', 4), ('smooth', 4), ('efficient', 4), ('administrative', 4), ('procedure', 4), ('establishment', 3), ('matter', 3)],
[('management', 21), ('satisfaction', 12), ('within', 9), ('budget', 9), ('township', 9), ('site', 9), ('periodic', 9), ('admin', 9), ('maintain', 9), ('guest', 6)],
[('manage', 2), ('ir', 2), ('mines', 2), ('implimentation', 2), ('clrg', 2), ('act', 2), ('implementations', 2), ('office', 2), ('maintenance', 2), ('administration', 2)]]}
records = []
for idx, top_words_list in enumerate(data['TopWords']):
for word, count in top_words_list:
rec = {
'Department': data['Department'][idx],
'Word': word,
'Count': count
}
records.append(rec)
res = pd.DataFrame(records)
CodePudding user response:
In addition to @sammywemmy answer, following approach would not need numpy
package, however due to double loops, it might not be as performant in large number of datasets.
df = {"Department": [], "Words": [], "Count": []}
for i in range(len(data["Department"])):
for j in range(len(data["TopWords"][i])):
df["Department"].append(data["Department"][i])
df["Words"].append(data["TopWords"][i][j][0])
df["Count"].append(data["TopWords"][i][j][1])
print(pd.DataFrame(df))
# (SUGGESTION) A more pythonic version of the code above
d = {"Department": [], "Words": [], "Count": []}
for idx, department in enumerate(data["Department"]):
for word, count in data["TopWords"][idx]:
d["Department"].append(department)
d["Words"].append(word)
d["Count"].append(count)
print(pd.DataFrame(d))
CodePudding user response:
One option using a dictionary comprehension:
(df
.drop(columns='TopWords')
.join(pd.concat({k: pd.DataFrame(x, columns=['Word', 'Counts'])
for k,x in enumerate(df['TopWords'])}).droplevel(1))
)
output:
Department Word Counts
0 D1 cat 6
0 D1 project 6
0 D1 dog 6
0 D1 develop 4
0 D1 smooth 4
0 D1 efficient 4
0 D1 administrative 4
0 D1 procedure 4
0 D1 establishment 3
0 D1 matter 3
1 D2 management 21
1 D2 satisfaction 12
1 D2 within 9
1 D2 budget 9
1 D2 township 9
1 D2 site 9
1 D2 periodic 9
1 D2 admin 9
1 D2 maintain 9
1 D2 guest 6
2 D3 manage 2
2 D3 ir 2
2 D3 mines 2
2 D3 implimentation 2
2 D3 clrg 2
2 D3 act 2
2 D3 implementations 2
2 D3 office 2
2 D3 maintenance 2
2 D3 administration 2