Home > Back-end >  Python - Extracting specific data from rows based on a condition in a dataframe
Python - Extracting specific data from rows based on a condition in a dataframe

Time:10-08

What I am trying to do: I have the following dataframe. I need to capture the rows of data where the space is more than 4 and place the information into an individual dictionary. For example - the rows between 2. harry potter and A. test book would be captured and so would the rows betweenA. test book and F. book3 (desired output shown below) I cannot pre define the number of dictionaries as I would not know how many I would need as it is dependent on the data.

Dataframe example:


name           | A | space
---------------|---|------
2. harry potter| 1 | 2    
   jk rowling  | 1 | 4   
   testing     | 3 | 4
A. test book   | 4 | 2
   author1     | 4 | 4
   author2     | 4 | 4
   author3     | 4 | 4
F. book3       | 5 | 2

Desired output:

dict_A = {name:'jk rowling',testing', A: '1','3', space: '4','4'}
dict_B = {name:'author1', 'author2', 'author3', A:'4','4','4', space: '4','4','4'}

What I have tried:

lst_A =[]

for i in df["name"]:
    if not i.startswith(('2.','A.','F.')):
        if len(i) - len(i.lstrip()) > 3:
            lst_A.append(i)
        else:
            if i.startswith(('2.','A.','F.')):
                pass
            

df as a dictionary:

The above code gives me only the data from the column name which is more than 4 spaces - however I am not able to segragate it correctly into individual lists / a dictionary and capture all the information. Is anyone able to help point me in the right direction?

For ref: df as a dictionary:

{'name': {0: '2. harry potter',
  1: '   jk rowling  ',
  2: '   testing     ',
  3: 'A. test book   ',
  4: '   author1     ',
  5: '   author2     ',
  6: '   author3     ',
  7: 'F. book3       '},
 'A': {0: 1, 1: 1, 2: 3, 3: 4, 4: 4, 5: 4, 6: 4, 7: 5},
 'space': {0: 2, 1: 4, 2: 4, 3: 2, 4: 4, 5: 4, 6: 4, 7: 2}}

CodePudding user response:

For this, I would use masks to select the data you want, and then just loop through the selection (using DataFrame.iterrows() or DataFrame.itertuples())

First create the mask

mask = df['A'] > df['space']
selected_rows = df.loc[mask, :]
# Loop through these rows

CodePudding user response:

You can use the following:

# remove spaces from 'name' (if needed to keep original df, make a copy first)
df['name'] = df['name'].str.strip()

# create group
m = df['space'].eq(4)
group = (m & (~m.shift(fill_value=False))).cumsum().where(m)

# create dictionary of dict
out = {'dict_%d' % k: d.to_dict('list') for k,d in df.groupby(group)}

output:

{'dict_1': {'name': ['jk rowling', 'testing'], 'A': [1, 3], 'space': [4, 4]},
 'dict_2': {'name': ['author1', 'author2', 'author3'], 'A': [4, 4, 4], 'space': [4, 4, 4]},
}

Accessing the dictionaries:

>>> out['dict_1']
{'name': ['jk rowling', 'testing'], 'A': [1, 3], 'space': [4, 4]}
  • Related