Forest
Summary
A simple collection class to aggregate tree objects. It takes Adjacency List as input, shows some stats and the top x biggest trees. This is a simple wrapper on top of rubytree for now.
Why ?
Most database tables have some hierarchy related data (eg: who's your boss, who invited you to join, etc) without even realising it in adjacency list format. Aggregating info from these trees are a bit difficult if the depth of each trees are not even. Also, most trees only have 0 or 1 node attached, which need to be filtered out before rendering. That's why I created a simple wrapper to extract only trees (top x biggest trees) which are interesting enough to render.
Usage
forest [filename]
Input file format
"parent key", "child key", "other attributes"(optional)
- If "child key" is either nil, NULL, or "", then the row becomes a root tree
- If "other attributes" are specified, they will be showed at tree diagram as optional information.
- Example sql to generate the input csv "select parent_id, id, name from foo"
Input file example
examples/input.csv
1,,foo,foo2,foo3
2,1
13,2
3,1,bar
4,3
5,3
6,3
10,9
9,8
8,7
7,
11,
14,12
12,
16,15
15,
17,15
18,15
19,15
20,15
This will form the following tree hierarchy.
1 - 2 - 13
- 3 - 4
- 5
- 6
7 - 8 - 9 - 10
11
12 - 14
15 - 16
- 17
- 18
- 19
- 20
And here is the output.
forest examples/input.csv
Total 20: Average :4.0 Max size :7 Max height :4 Max width :5, Sandard Deviation 2.28035085019828
Top 3 trees
#1, sum 7, height 2
* 1 foo,foo2,foo3
|---+ 2
| +---> 13
+---+ 3 bar
|---> 4
|---> 5
+---> 6
Node Name: 2 Content: Parent: 1 Children: 1 Total Nodes: 2
Node Name: 3 Content: bar Parent: 1 Children: 3 Total Nodes: 4
#2, sum 6, height 1
* 15
|---> 16
|---> 17
|---> 18
|---> 19
+---> 20
Node Name: 16 Content: Parent: 15 Children: 0 Total Nodes: 1
Node Name: 17 Content: Parent: 15 Children: 0 Total Nodes: 1
Node Name: 18 Content: Parent: 15 Children: 0 Total Nodes: 1
Node Name: 19 Content: Parent: 15 Children: 0 Total Nodes: 1
Node Name: 20 Content: Parent: 15 Children: 0 Total Nodes: 1
#3, sum 4, height 3
* 7
+---+ 8
+---+ 9
+---> 10
Node Name: 8 Content: Parent: 7 Children: 1 Total Nodes: 3
#4, sum 2, height 1
* 12
+---> 14
Node Name: 14 Content: Parent: 12 Children: 0 Total Nodes: 1
#5, sum 1, height 0
* 11
Performance.
I tried parsing about 80000 rows. It used to take about 20 min, but now it takes 40 sec with 200mb memory space. The result may very depending on how deep your each tree is.
TODO (or my wish list)
- Get rid of "Node Name" output (Annoying)
- Add more aggregate functions
- Add more filtering functions (eg: show trees which has depth of more than 3)
- Create a conversion program to draw on Graphviz diagram
- Create a conversion program to draw histogram on R
- Create a conversion program to nested set model or materialised path
- Create an adapter to switch between rdbms, nosql, or file system to avoid storing everything in memory.