I guess professional data analysts know an answer to this, but I'm no analyst. And I just barely know Pandas. So I am at a loss.
There are two lists. Their contents are unpredictable (parsed from web counters, web analytics, web statistics, etc).
list1 = ['WordA', 'WordB', ..., 'WordXYZ']
...and...
list2 = [['WordA1', 'WordA2'], ['WordB1'], ['WordC1', 'WordC2', ,'WordC96'], ..., ['WordXYZ1', 'WordXYZ2']]
Length of two lists are always equal (they`re the results of work of parser I already wrote)
What I need is to create a dataframe which will have two rows for each item in list1
, each containing the word in first column, and then put corresponding words from list2
into first row of those two (starting from second column, first column to bealready filled from list1
).
So I imagine the following steps:
- Create a dataframe filled with empty strings ('') with number of columns equal to
len(max(list2, key=len))
and number of rows equal to twice length oflist1
(aaaand I don't know how, this is actually my very second time using Pandas at all!); - Somehow fill first column of resulting dataframe with contents of
list1
, filling two rows for each item inlist1
; - Somehow put contents of
list2
into every even row of the dataframe, starting with second column; - Save into .xls file (yes, that's the final goal), enjoy job done.
Now first thing, I already spend half a day trying to find an answer "how to create pandas dataframe filled with empty strings with given number of rows and columns", and found a lot of different articles, which contradict each other.
And second, there's got to be a way to do all this more pythonic, more efficient and more stylish way!
Aaaand, maybe there`s a way to create an excel file without using pandas at all, which I just don't know about (hopefully, yet)
Can anyone help, please?
UPD: (to answer a question) the results should look like:
WordA WordA1 WordA2
WordA
WordB WordB1
WordB
WordC WordC1 WordC2 (...) WordC96
WordC
(...)x2
WordXYZ WordXYZ1 WordXYZ2
WordXYZ
CodePudding user response:
The following should give you (almost) what you want:
import pandas as pd
from itertools import chain
list1 = ['WordA', 'WordB']
list2 = [['WordA1', 'WordA2'], ['WordB1']]
# Flatten list 2
list2 = list(chain(*list2))
# Create DataFrames
list1 = pd.DataFrame(data=list1, columns=["word1"])
list2 = pd.DataFrame(data=list2, columns=["word2"])
# Prefix for list2
list2["prefix"] = list2["word2"].str.extract("([^0-9] )")
list1 = list1.merge(list2, left_on="word1", right_on="prefix", how="inner")
# Concatenated words
list1 = list1.groupby("word1")["word2"].agg(lambda x: " ".join(x)).reset_index()
list1["word2"] = list1["word1"].str.cat(list1["word2"], sep=" ")
list1 = pd.melt(list1).sort_values(by="value")
CodePudding user response:
If you just want to write the lists to an Excel file, you don't need pandas. You can use for instance openpyxl
:
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
for *word, words in zip(list1, list2):
ws.append(word words)
ws.append(word)
wb.save('output.xlsx')
If you really want to use pandas:
import pandas as pd
df = pd.DataFrame([[None] x if isinstance(x, list) else [x] for pair in zip(list2, list1) for x in pair])
df[0] = df[0].bfill()
df.to_excel('output.xlsx', index=False, header=False)