Home > Blockchain >  How to filter out and extract data from a dataframe which contains multiple lines per row?
How to filter out and extract data from a dataframe which contains multiple lines per row?

Time:06-06

I have a dataframe that looks something like this

   Detail
  --------------
0  Animal: Lion,
    Weight: 600 kg,
    Class: Mammal

1  Animal: Monkey,
    Weight: 10 kg,

2  Animal: Snake,
    Class: Reptile

3  Animal: Frog,
    Class: Amphibian,
    Weight: 1 kg

4  Animal: Hawk,
    Class: Bird

I need to read each row to see if a particular string is present in each line, then strip the line and add it to a list. If the value is not present, I need it to be appended with a null value. Then, I want to export this list as a csv file.

My desired output here is

   Animal
  --------------
0  Animal: Lion,
1  Animal: Monkey,
2  Animal: Snake,
3  Animal: Frog,
4  Animal: Hawk,


   Weight
  --------------
0  Weight: 600 kg,
1  Weight: 10 kg,
2  Weight: NaN
3  Weight: 1 kg,
4  Weight: NaN

   Class
  --------------
0  Class: Mammal,
1  Class: NaN
2  Class: Reptile
3  Class: Amphibian
4  Class: NaN

I need these 3 dataframes to be exported as a csv file.

I have written something like this, but don't think it is the right approach

cols= ['Animal','Weight','Class']


for entry in cols:
    values=[]
    for content in df['Detail']:
        if(df['Detail'].str.contains(entry)):
            var=content.str.split('\n')
        else:
            var= np.nan
        values.append(var)
    values_df = pd.DataFrame(values)
    values_df.to_csv('%s.csv' %entry,header=entry,index=False)

CodePudding user response:

Please check this. Splitting a pandas dataframe column by delimiter I haven't run it, but it should be something like:

df[['Animal','Weight', 'Class']] = df['Detail'].str.split(',',expand=True)

If you want then to convert the column "Animal" to a list. you can do it like so:

arr = df["Animal"].to_numpy()

CodePudding user response:

In a way, your trial is correct, the only part you have missed is filtering the correct column values(as Animal, Weight, Class will not be in order)
#splitted at comma and removed extra space if any in all values

df['Detail'] = [[y.strip() for y in x.split(',')] for x in df['Detail'].values]
vals = []
for idx, row in df.iterrows():
    temp = {'Animal':np.nan,'Weight':np.nan,'Class':np.nan}
    for x in row['details']:
        for key in temp.keys():
            if x.startswith(key):
                temp[key]= x
    vals.append(temp)
final_df = pd.DataFrame(vals)
#print('final_df==', final_df)

Now you final_df will have 3 columns, and you can easily create separate dfs out of it by

final_df[['columnName']].to_csv('filaname.csv', index=False)

Let me know if it helps or in case you need more clarity on code.

  • Related