I need to preprocess some data so that I can start analyzing it. I currently have a data frame which contains data of Eurovision winners. I need to create a new data frame which contains the words from each of the songs, with the points of each song assigned to each word in a tuple. For example, if the song name is 'Hello World'
and the score is 31, I need to create two tuples (Hello, 31)
and (World, 31)
and add them to a list from which I can create a new data frame.
Sample input
Here is the first row of my dataframe.
Sample Output
The output I want from the first row is
[('Net', 31),('als', 31),('toen', 31)]
Attempt
def TupleGenerator(row):
list =[]
for item in ev['Song']:
tuple = (item, ev["Points"])
list.append(tuple)
return list
TupleGenerator(ev.iloc[0])
This is what I have tried so far, but I am not sure how to get the score from the same row to be assigned to the word in the tuple.
Any advice is appreciated, thank you.
CodePudding user response:
You have the right idea, only right now you are iterating over every character in the string row["Song"]
. You need to split this string up into a sequence of substrings where each substring represents a word from the song. Then iterate over this sequence. This code shows how one might do that
def TupleGenerator(row):
result = []
for word in row["Song"].strip('"').split():
result.append((word, row["Points"]))
return result
The strip
method of strings accepts one optional argument that is a string that specifies the set of characters to be removed. In our case, we need to remove "
. The split
method without any arguments returns a list of the words in the string, using consecutive whitespace string subsequences as the delimiter.
For example, if your df
is
df = pd.DataFrame(
{"Year": 1957,
"Date": "3-Mar",
"Host City": ["Frankfurt", "Linux"],
"Winner": ["Netherlands", "Unix"],
"Song": ['"Net als toen"', '"git hub"'],
"Performer": ["Corry Brokken", "Stack Overflow"],
"Points": [31, 32],
"Margin": [14, 15],
"Runner-up": ["France", "cyberspace"]
}
)
Running
for index, row in df.iterrows():
print(TupleGenerator(row))
gives output
[('Net', 31), ('als', 31), ('toen', 31)]
[('git', 32), ('hub', 32)]
I hope this helps. Let me know if there are any questions!