Home > Software design >  using groupby for datetime values in pandas
using groupby for datetime values in pandas

Time:11-10

I'm using this code in order to groupby my data by year df = pd.read_csv('../input/companies-info-wikipedia-2021/sparql_2021-11-03_22-25-45Z.csv')

df = pd.read_csv('../input/companies-info-wikipedia-2021/sparql_2021-11-03_22-25-45Z.csv')
df_duplicate_name = df[df.duplicated(['name'])]
df = df.drop_duplicates(subset='name').reset_index()
df = df.drop(['a','type','index'],axis=1).reset_index()
df = df[~df['foundation'].str.contains('[A-Za-z]', na=False)]
df = df.drop([140,214,220])
df['foundation'] = df['foundation'].fillna(0)
df['foundation'] = pd.to_datetime(df['foundation'])
df['foundation'] = df['foundation'].dt.year
df = df.groupby('foundation')

But as a result it does not group it by foundation values:

0   0   Deutsche EuroShop AG    1999    http://dbpedia.org/resource/Germany Investment in shopping centers  http://dbpedia.org/resource/Real_property   4   2.964E9 1.25E9  2.241E8 8.04E7
1   1   Industry of Machinery and Tractors  1996    http://dbpedia.org/resource/Belgrade    http://dbpedia.org/resource/Tractors    http://dbpedia.org/resource/Agribusiness    4   4.648E7 0.0 30000.0 -€0.47 million
2   2   TelexFree Inc.  2012    http://dbpedia.org/resource/Massachusetts   99  http://dbpedia.org/resource/Multi-level_marketing   7   did not disclose    did not disclose    did not disclose    did not disclose
3   3   (prev. Common Cents Communications Inc.)    2012    http://dbpedia.org/resource/United_States   99  http://dbpedia.org/resource/Multi-level_marketing   7   did not disclose    did not disclose    did not disclose    did not disclose
4   4   Bionor Holding AS   1993    http://dbpedia.org/resource/Oslo    http://dbpedia.org/resource/Health_care http://dbpedia.org/resource/Biotechnology   18  NOK 253 395 million NOK 203 320 million 1.09499E8   NOK 49 020 million
... ... ... ... ... ... ... ... ... ... ... ...
255 255 Ageas SA/NV 1990    http://dbpedia.org/resource/Belgium http://dbpedia.org/resource/Insurance   http://dbpedia.org/resource/Financial_services  45000   1.0872E11   1.348E10    1.112E10    9.792E8
256 256 Sharp Corporation   1912    http://dbpedia.org/resource/Japan   Televisions, audiovisual, home appliances, inf...   http://dbpedia.org/resource/Consumer_electronics    52876   NaN NaN NaN NaN
257 257 Erste Group Bank AG 2008    Vienna, Austria Retail and commercial banking, investment and ...   http://dbpedia.org/resource/Financial_services  47230   2.71983E11  1.96E10 6.772E9 1187000.0
258 258 Manulife Financial Corporation  1887    200 Asset management, Commercial banking, Commerci...   http://dbpedia.org/resource/Financial_services  34000   750300000000    47200000000 39000000000 4800000000
259 259 BP plc  1909    London, England, UK http://dbpedia.org/resource/Natural_gas http://dbpedia.org/resource/Petroleum_industry

I also tried with making it again pd.to_datetime and sorting by dt.year - but still unsuccessful.

Column names:

Index(['index', 'name', 'foundation', 'location', 'products', 'sector',
   'employee', 'assets', 'equity', 'revenue', 'profit'],
  dtype='object')

CodePudding user response:

I think you're misunderstanding how groupby() works.

You can't do df = df.groupby('foundation'). groupby() does not return a new DataFrame. Instead, it returns a GroupBy, which is essentially just a mapping from value grouped-by to a dataframe containg the rows that all share that value for the specified column.

You can, for example, print how many rows are in each group with the following code:

groups = df.groupby('foundation')
for val, sub_df in groups:
    print(f'{val}: {sub_df.shape[0]} rows')

CodePudding user response:

@Ruslan you simply need to use a "sorting" command, not a "groupby" . You can achieve this generally in two ways:

myDF.sort_value(by='column_name' , ascending= 'true', inplace=true)

or, in case you need to set your column as index, you would need to do this:

myDF.index.name = 'column_name'
myDF.sort_index(ascending=True)

GroupBy is a totally different command, it is used to make actions after you group values by some criteria. Such as find sum, average , min, max of values, grouped-by some criteria.

pandas.DataFrame.sort_values

pandas.DataFrame.groupby

  • Related