What I am trying to do: I have the following dataframe. I need to capture the rows of data where the space is more than 4 and place the information into an individual dictionary.
For example - the rows between 2. harry potter
and A. test book
would be captured and so would the rows betweenA. test book
and F. book3
(desired output shown below)
I cannot pre define the number of dictionaries as I would not know how many I would need as it is dependent on the data.
Dataframe example:
name | A | space
---------------|---|------
2. harry potter| 1 | 2
jk rowling | 1 | 4
testing | 3 | 4
A. test book | 4 | 2
author1 | 4 | 4
author2 | 4 | 4
author3 | 4 | 4
F. book3 | 5 | 2
Desired output:
dict_A = {name:'jk rowling',testing', A: '1','3', space: '4','4'}
dict_B = {name:'author1', 'author2', 'author3', A:'4','4','4', space: '4','4','4'}
What I have tried:
lst_A =[]
for i in df["name"]:
if not i.startswith(('2.','A.','F.')):
if len(i) - len(i.lstrip()) > 3:
lst_A.append(i)
else:
if i.startswith(('2.','A.','F.')):
pass
df as a dictionary:
The above code gives me only the data from the column name which is more than 4 spaces - however I am not able to segragate it correctly into individual lists / a dictionary and capture all the information. Is anyone able to help point me in the right direction?
For ref: df as a dictionary:
{'name': {0: '2. harry potter',
1: ' jk rowling ',
2: ' testing ',
3: 'A. test book ',
4: ' author1 ',
5: ' author2 ',
6: ' author3 ',
7: 'F. book3 '},
'A': {0: 1, 1: 1, 2: 3, 3: 4, 4: 4, 5: 4, 6: 4, 7: 5},
'space': {0: 2, 1: 4, 2: 4, 3: 2, 4: 4, 5: 4, 6: 4, 7: 2}}
CodePudding user response:
For this, I would use masks to select the data you want, and then just loop through the selection (using DataFrame.iterrows()
or DataFrame.itertuples()
)
First create the mask
mask = df['A'] > df['space']
selected_rows = df.loc[mask, :]
# Loop through these rows
CodePudding user response:
You can use the following:
# remove spaces from 'name' (if needed to keep original df, make a copy first)
df['name'] = df['name'].str.strip()
# create group
m = df['space'].eq(4)
group = (m & (~m.shift(fill_value=False))).cumsum().where(m)
# create dictionary of dict
out = {'dict_%d' % k: d.to_dict('list') for k,d in df.groupby(group)}
output:
{'dict_1': {'name': ['jk rowling', 'testing'], 'A': [1, 3], 'space': [4, 4]},
'dict_2': {'name': ['author1', 'author2', 'author3'], 'A': [4, 4, 4], 'space': [4, 4, 4]},
}
Accessing the dictionaries:
>>> out['dict_1']
{'name': ['jk rowling', 'testing'], 'A': [1, 3], 'space': [4, 4]}