Home > Back-end >  How to automate re-modelling of a large dataset
How to automate re-modelling of a large dataset

Time:08-24

I have a dataset that's roughly 5200 rows long with an unpleasant structure to use for my use-case.

A small sample of my data is:

| Name | Lvl | Stats | Spawn Weighting |
| -------- | -------------- | --- | --- |
| AdditionalPierceEldritchImplicit1 | 75 | Projectiles Pierce an additional Target | no_tier_6_eldritch_implicit 0 |
||||gloves 500|
||||default 0|
| AdditionalPierceEldritchImplicit2 | 75 | Projectiles Pierce an additional Target | no_tier_5_eldritch_implicit 0 |
||||gloves 500|
||||default 0|
| AdditionalPierceEldritchImplicit3 | 75 | Projectiles Pierce an additional Target | no_tier_4_eldritch_implicit 0 |
||||gloves 500|
||||default 0|

A large amount of the dataset follows this structure, but some amount of rows look different, for instance like:

| Name | Lvl | Stats | Spawn Weighting |
| -------- | -------------- | --- | --- |
| AttackDamageEldritchImplicitPinnaclePresence3| 75 | (44-46)% increased Attack Damage | no_tier_6_eldritch_implicit 0 |
||||helmet 200|
||||amulet 200|
||||default 0|
| AttackDamageEldritchImplicitPinnaclePresence4| 75 | (44-46)% increased Attack Damage | no_tier_5_eldritch_implicit 0 |
||||helmet 200|
||||amulet 200|
||||default 0|
| AttackDamageEldritchImplicitPinnaclePresence5| 75 | (44-46)% increased Attack Damage | no_tier_4_eldritch_implicit 0 |
||||helmet 200|
||||amulet 200|
||||default 0|

All of the attributes in the 'spawn weighting' column where the first three columns are empty apply to next non-empty row above. As such I'd like to display them in a wide format such as:

| Name | Lvl | Stats | Tier | Weight1 | Weight2 | Default |
| -------- | -------------- | --- | --- | --- | ---| ---|
| AdditionalPierceEldritchImplicit1 | 75 | Projectiles Pierce an additional Target | no_tier_6_eldritch_implicit 0 | gloves 500 ||default 0 |
| AdditionalPierceEldritchImplicit2 | 75 | Projectiles Pierce an additional Target | no_tier_5_eldritch_implicit 0 |gloves 500||default 0 |
| AdditionalPierceEldritchImplicit3 | 75 | Projectiles Pierce an additional Target | no_tier_4_eldritch_implicit 0 |gloves 500||default 0 |
| AttackDamageEldritchImplicitPinnaclePresence3| 75 | (44-46)% increased Attack Damage | no_tier_6_eldritch_implicit 0 |helmet 200|amulet 200|default 0|
| AttackDamageEldritchImplicitPinnaclePresence4| 75 | (44-46)% increased Attack Damage | no_tier_5_eldritch_implicit 0 |helmet 200|amulet 200|default 0|
| AttackDamageEldritchImplicitPinnaclePresence5| 75 | (44-46)% increased Attack Damage | no_tier_4_eldritch_implicit 0 |helmet 200|amulet 200|default 0|

My problem is that the attributes are variable between 2 and 3 empty rows beneath each column (none are less than 2 or more than 3).

I've looked into Excels' (where I've currently got the dataset) Text to Columns feature which quickly turned out to be the wrong place to look, and I'm currently trying to use SQLs query function to import the data from a .csv and try and transform it before loading it in. I am however, very inexperienced with this and I'm hoping someone with more experience with this can show me in the right direction. If the solution would be better achieved with a different tool, I wouldn't mind.

I've made an export of the dataset as csv in plaintext to pastebin, in case that helps: https://pastebin.com/ufQvDicW

EDIT: Apologies, because stackoverflow thinks my tables' contents look like code I had to format them like this instead of as tables.

CodePudding user response:

This seems to work in powerquery. Its mostly long to move over the default into a separate column if needed. You can paste it into home...advanced editor ... changing the path to your file

let  Source = Csv.Document(File.Contents("C:\Temp\test.csv"),[Delimiter=";", Columns=4, Encoding=1252, QuoteStyle=QuoteStyle.None]),
#"Promoted Headers" = Table.PromoteHeaders(Source, [PromoteAllScalars=true]),
#"Changed Type" = Table.TransformColumnTypes(#"Promoted Headers",{{"Name", type text}, {"Lvl", Int64.Type}, {"Stats", type text}, {"Spawn Weighting", type text}}),
#"Added Index" = Table.AddIndexColumn(#"Changed Type", "Index", 0, 1, Int64.Type),
#"Added Custom" = Table.AddColumn(#"Added Index", "Custom", each if [Name]="" or [Name]=null then null else [Index]),
#"Filled Down" = Table.FillDown(#"Added Custom",{"Custom"}),
#"Removed Columns" = Table.RemoveColumns(#"Filled Down",{"Index"}),
#"Unpivoted Other Columns" = Table.UnpivotOtherColumns(#"Removed Columns", {"Custom"}, "Attribute", "Value"),
#"Filtered Rows" = Table.SelectRows(#"Unpivoted Other Columns", each ([Value] <> "")),
#"Grouped Rows" = Table.Group(#"Filtered Rows", {"Custom"}, {{"data", each _[Value] }}),
#"Added Custom1" = Table.AddColumn(#"Grouped Rows", "Custom.1", each
        Table.AddIndexColumn(
            Table.FromList(
                List.Transform(     
                    if List.Count([data])=6 then List.Combine({List.FirstN([data],5),{null},List.LastN([data],1)}) else [data],
                each Text.From(_)))
        , "Index2", 0, 1, Int64.Type)
 ),
#"Removed Columns1" = Table.RemoveColumns(#"Added Custom1",{"data"}),
#"Expanded Custom.1" = Table.ExpandTableColumn(#"Removed Columns1", "Custom.1", {"Column1", "Index2"}, {"Column1", "Index2"}),
#"Pivoted Column" = Table.Pivot(Table.TransformColumnTypes(#"Expanded Custom.1", {{"Index2", type text}}, "en-US"), List.Distinct(Table.TransformColumnTypes(#"Expanded Custom.1", {{"Index2", type text}}, "en-US")[Index2]), "Index2", "Column1"),
#"Removed Columns2" = Table.RemoveColumns(#"Pivoted Column",{"Custom"})
in #"Removed Columns2"

enter image description here

  • Related