Assigning numeric values to column in Python-CodePudding

I'm a beginner so sorry in advance if I'm unclear :)

I have a .csv with 2 columns, doc_number and text. However, sometimes the rows start with the doc number (as they should) and sometimes it just starts with text from the previous row. All input is the type 'object'. There are also many empty rows between the inputs.

How can I make sure doc_number consists of all the numeric values (doc_numbers are random numbers of 8 digits) and text is the text, and remove the empty rows?

Example of what it looks like:

69353029, Hello. How are you doing?


What are you going to do tomorrow?


59302058, Tomorrow I have to go to work.


58330394, It's going to rain tomorrow


45801923, Yesterday it was sunny.


Next week it will also be sunny.


68403942, Thank you.

What it should look like:

_doc, _text
69353029, Hello. How are you doing? What are you going to do tomorrow?
59302058, Tomorrow I have to go to work.
58330394, It's going to rain tomorrow.
45801923, Yesterday it was sunny. Next week it will also be sunny.
68403942, Thank you.```

CodePudding user response：

Here's what I can think of when looking at the dataset. It looks like a CSV file which translates to an unnamed column (representing _doc) with mixed integers and strings and another unnamed column with strings (representing _text)

Since, it has to be csv, I created the following

What we can do is separate the mixed column into two different columns m_numeric and m_text. This can be done by creating the m_numeric column using to_numeric(,errors='coerce') and non-numeric strings masked to m_text column as follows:

df['m_numeric'] = pd.to_numeric(df['mixed'], errors='coerce')
mask = df['m_numeric'].isna()
df.loc[mask, 'm_text'] = df.loc[mask, 'mixed']

We can now fill the NaN values in m-numeric using ffill from

Now we have the final columns we need as text and m_numeric on which we can apply the pandas groupby() function. We groupby the m_numeric column and use .apply to join the strings in two rows separated by a space. Finally we can rename the column names to _doc and _text as follows:

df = df.groupby('m_numeric', sort= False)['text'].apply(' '.join).reset_index()
df  = df.rename(columns= {'m_numeric' : '_doc',
                   'text' : '_text'})

Result:

Complete Code:

import pandas as pd

df = pd.read_csv('dummy.csv', header= None)
df.columns = ['mixed', 'text']

#separate column
df['m_numeric'] = pd.to_numeric(df['mixed'], errors='coerce')
mask = df['m_numeric'].isna()
df.loc[mask, 'm_text'] = df.loc[mask, 'mixed']

#replace nan values  
df['m_numeric'] = df['m_numeric'].fillna(method='ffill')
df['m_numeric'] = df['m_numeric'].astype(int)
df['text'] = df['text'].fillna(df['m_text'])

#group by column
df = df.groupby('m_numeric', sort= False)['text'].apply(' '.join).reset_index()
df  = df.rename(columns= {'m_numeric' : '_doc',
                   'text' : '_text'})

df

CodePudding user response：

also just learnning to panda, so maybe not the best solution:

import pandas as pd
import numpy as np

#reading .csv and naming the Headers, - you can choose your own namings here
df = pd.read_csv("text.csv", header=None, names = ["_doc","_text"])

# updating _text column emty cells with values from _doc
df["_text"] = np.where(df['_text'].isnull(),df["_doc"],df["_text"])

# change dtype to int (it will generate <NA> in strings) and back to str to aggregate later
df["_doc"] = df["_doc"].apply(pd.to_numeric, errors="coerce").astype("Int64").astype(str)

# aggregating rows if below value is <NA> and joining strings in col _text
df = df.groupby((df["_doc"].ne("<NA>")).cumsum()).agg({"_doc":"first","_text":" ".join}).reset_index(drop=True)

# converting back to int (if needed)
df["_doc"] = df["_doc"].astype(int)
print(df)

out:

       _doc                                              _text
0  69353029   Hello. How are you doing? What are you going ...
1  59302058                     Tomorrow I have to go to work.
2  58330394                        It's going to rain tomorrow
3  45801923   Yesterday it was sunny. Next week it will als...
4  68403942                                         Thank you.