how to draw a tree diagram in r
Abraham Lincoln once said, "Give me six hours to chop down a tree and I will spend the first four sharpening the axe."
Aunt Margaret used to say, "If you dream of a forest, you'd better learn how to plant a tree."
data.tree says, "No matter if you are a lumberjack or a tree hugger. I will be your sanding block, and I will be your seed."
- Introduction
- Trees
- Trees in R
- Trees in
data.tree
-
data.treebasics- Definitions
- Tree creation
- Node methods
- Climbing a tree (tree navigation)
- Custom attributes
- Printing
- Plotting
- Tree Conversion
- Tree Traversal
-
Get -
Do -
Set -
Traverseand explicit traversal
-
- Advanced Features
-
Aggregate -
Cumulate -
Clone -
Sort -
Prune
-
- Performance Considerations
- CPU
- Memory
Introduction
Trees
Trees are ubiquitous in mathematics, computer science, data sciences, finance, and in many other attributes. Trees are especially useful when we are facing hierarchical data. For example, trees are used:
- in decision theory (cf. decision trees)
- in machine learning (e.g. classification trees)
- in finance, e.g. to classify financial instruments into asset classes
- in routing algorithms
- in computer science and programming (e.g. binary search trees, XML)
- e.g. for family trees
For more details, see the applications vignette by typing vignette("applications", package = "data.tree")
Trees in R
Tree-like structures are already used in R. For example, environments can be seen as nodes in a tree. And CRAN provides numerous packages that deal with tree-like structures, especially in the area of decision theory. Yet, there is no general purpose hierarchical data structure that could be used as conveniently and generically as, say, data.frame.
As a result, people often try to resolve hierarchical problems in a tabular fashion, for instance with data.frames. But often, hierarchies don't marry with tables, and various workarounds are usually required.
Trees in data.tree
This package offers an alternative. The data.tree package lets you create hierarchies, called data.tree structures. The building block of theses structures are Node objects. The package provides basic traversal, search, and sort operations, and an infrastructure for recursive tree programming. You can decorate Nodes with your own attributes and methods, so as to extend the package to your needs.
The package also provides convenience methods for neatly printing and plotting trees. It supports conversion from and to data.frames, lists, and other tree structures such as dendrogram, phylo objects from the ape package, igraph, and other packages.
Technically, data.tree structures are bi-directional, ordered trees. Bi-directional means that you can navigate from parent to children and vice versa. Ordered means that the sort order of the children of a parent node is well-defined.
data.tree basics
Definitions
-
data.treestructure: a tree, consisting of multipleNodeobjects. Often, the entry point to adata.treestructure is the root Node -
Node: both a class and the basic building block ofdata.treestructures - attribute: an active, a field, or a method. **Not to be confused with standard R attributes, c.f.
?attr, which have a different meaning. Many methods and functions have anattributearg, which can refer to a an active, a field or a method. For example, see?Get - active (sometimes called property): a field on a
Nodethat can be called like an attribute, but behaves like a function without arguments. For example:node$position - field: a named value on a
Node, e.g.node$cost <- 2500 - method: a function acting on an object (on a
Nodein this context). Many methods are available in OO style (e.g.node$Revert()) or in traditional style (Revert(node)) - inheritance: in this context, inheritance refers to a situation in which a child
Nodeinherits e.g. an attribute from one of its ancestors. For example, see?Get,?SetNodeStyle
Tree creation
There are different ways to create a data.tree structure. For example, you can create a tree programmatically, by conversion from other R objects, or from a file.
Create a tree programmatically
Let's start by creating a tree programmatically. We do this by creating Node objects, and linking them together so as to define the parent-child relationships.
In this example, we are looking at a company, Acme Inc., and the tree reflects its organisational structure. The root (level 1) is the company. On level 2, the nodes represent departments, and the leaves of the tree represent projects that the company is considering for next year:
library(data.tree) acme <- Node$new("Acme Inc.") accounting <- acme$AddChild("Accounting") software <- accounting$AddChild("New Software") standards <- accounting$AddChild("New Accounting Standards") research <- acme$AddChild("Research") newProductLine <- research$AddChild("New Product Line") newLabs <- research$AddChild("New Labs") it <- acme$AddChild("IT") outsource <- it$AddChild("Outsource") agile <- it$AddChild("Go agile") goToR <- it$AddChild("Switch to R") print(acme) ## levelName ## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Software ## 4 ¦ °--New Accounting Standards ## 5 ¦--Research ## 6 ¦ ¦--New Product Line ## 7 ¦ °--New Labs ## 8 °--IT ## 9 ¦--Outsource ## 10 ¦--Go agile ## 11 °--Switch to R As you can see from the previous example, each Node is identified by its name, i.e. the argument you pass into the Node$new(name) constructor. The name needs to be unique among siblings, such that paths to Nodes are unambiguous.
Node inherits from R6 reference class. This has the following implications:
- You can call methods on a
Nodein OO style, e.g.acme$Get("name") -
Nodeexhibits reference semantics. Thus, multiple variables in R can point to the sameNode, and modifying aNodewill modify it for all referencing variables. In the above code example, bothacme$ITanditreference the same object. This is different from the value semantics, which is much more widely used in R.
Create a tree from a data.frame
Creating a tree programmatically is useful especially in the context of algorithms. However, most times you will create a tree by conversion. This could be by conversion from a nested list-of-lists, by conversion from another R tree-structure (e.g. an ape phylo), or by conversion from a data.frame. For more details on all the options, type ?as.Node and refer to the See Also section.
One of the most common conversions is the one from a data.frame in table format. The following code illustrates this. We load the GNI2014 data from the treemap package. This data.frame is in table format, meaning that each row will represent a leaf in the data.tree structure:
library(treemap) data(GNI2014) head(GNI2014) ## iso3 country continent population GNI ## 3 BMU Bermuda North America 67837 106140 ## 4 NOR Norway Europe 4676305 103630 ## 5 QAT Qatar Asia 833285 92200 ## 6 CHE Switzerland Europe 7604467 88120 ## 7 MAC Macao SAR, China Asia 559846 76270 ## 8 LUX Luxembourg Europe 491775 75990 Let's convert that into a data.tree structure! We start by defining a pathString. The pathString describes the hierarchy by defining a path from the root to each leaf. In this example, the hierarchy comes very naturally:
GNI2014$pathString <- paste("world", GNI2014$continent, GNI2014$country, sep = "/") Once our pathString is defined, conversion to Node is very easy:
population <- as.Node(GNI2014) print(population, "iso3", "population", "GNI", limit = 20) ## levelName iso3 population GNI ## 1 world NA NA ## 2 ¦--North America NA NA ## 3 ¦ ¦--Bermuda BMU 67837 106140 ## 4 ¦ ¦--United States USA 313973000 55200 ## 5 ¦ ¦--Canada CAN 33487208 51630 ## 6 ¦ ¦--Bahamas, The BHS 309156 20980 ## 7 ¦ ¦--Trinidad and Tobago TTO 1310000 20070 ## 8 ¦ ¦--Puerto Rico PRI 3971020 19310 ## 9 ¦ ¦--Barbados BRB 284589 15310 ## 10 ¦ ¦--St. Kitts and Nevis KNA 40131 14920 ## 11 ¦ ¦--Antigua and Barbuda ATG 85632 13300 ## 12 ¦ ¦--Panama PAN 3360474 11130 ## 13 ¦ ¦--Costa Rica CRI 4253877 10120 ## 14 ¦ ¦--Mexico MEX 111211789 9870 ## 15 ¦ ¦--Grenada GRD 90739 7910 ## 16 ¦ ¦--St. Lucia LCA 160267 7260 ## 17 ¦ ¦--Dominica DMA 72660 6930 ## 18 ¦ ¦--St. Vincent and the Grenadines VCT 104574 6610 ## 19 ¦ ¦--Dominican Republic DOM 9650054 6040 ## 20 ¦ °--... 7 nodes w/ 0 sub NA NA ## 21 °--... 6 nodes w/ 171 sub NA NA This is a simple example, and more options are available. Type ?FromDataFrameTable for all the details.
Create a tree from a file
Often, trees are created from one of many file formats. When developing this package, We opted for a multi-step approach, meaning that you first import the file into one of the well-known R data structures. Then you convert these into a data.tree structure. For example, typical import patterns could be:
- csv -> data.frame in table format (
?read.csv) -> data.tree (?as.Node.data.frame) - Newick -> ape phylo (
?ape::read.tree) -> data.tree (?as.Node.phylo) - csv -> data.frame in network format (
?read.csv) -> data.tree (c.f.?FromDataFrameNetwork) - yaml -> list of lists (
?yaml::yaml.load) -> data.tree (?as.Node.list) - json -> list of lists (e.g.
?jsonlite::fromJSON) -> data.tree (?as.Node.list)
If you have a choice, we recommend you consider yaml format to store and share your hierarchies. It is concise, human-readable, and very easy to convert to a data.tree. An example is provided here for illustration. The data represents what platforms and OS versions a group of students use:
library(yaml) yaml <- " name: OS Students 2014/15 OS X: Yosemite: users: 16 Leopard: users: 43 Linux: Debian: users: 27 Ubuntu: users: 36 Windows: W7: users: 31 W8: users: 32 W10: users: 4 " osList <- yaml.load(yaml) osNode <- as.Node(osList) print(osNode, "users") ## levelName users ## 1 OS Students 2014/15 NA ## 2 ¦--OS X NA ## 3 ¦ ¦--Yosemite 16 ## 4 ¦ °--Leopard 43 ## 5 ¦--Linux NA ## 6 ¦ ¦--Debian 27 ## 7 ¦ °--Ubuntu 36 ## 8 °--Windows NA ## 9 ¦--W7 31 ## 10 ¦--W8 32 ## 11 °--W10 4 Node methods
As seen above, a data.tree structure is composed of Node objects, and the entry point to a data.tree structure is always a Node, often the root Node of a tree.
There are different types of methods:
- OO-style actives (sometimes called properties) on
Nodes, such as e.g.Node$isRoot - OO-style methods on
Nodes, such as e.g.Node$AddChild(name) - Classical R methods, such as e.g.
Clone(node).
Actives Examples (aka Properties)
Actives look and feel like attributes, but they are dynamically evaluated. They are documented in the Node documentation, which is accessed by typing ?Node.
Remember our population example:
print(population, limit = 15) ## levelName ## 1 world ## 2 ¦--North America ## 3 ¦ ¦--Bermuda ## 4 ¦ ¦--United States ## 5 ¦ ¦--Canada ## 6 ¦ ¦--Bahamas, The ## 7 ¦ ¦--Trinidad and Tobago ## 8 ¦ ¦--Puerto Rico ## 9 ¦ ¦--Barbados ## 10 ¦ ¦--St. Kitts and Nevis ## 11 ¦ ¦--Antigua and Barbuda ## 12 ¦ ¦--Panama ## 13 ¦ ¦--Costa Rica ## 14 ¦ ¦--Mexico ## 15 ¦ °--... 12 nodes w/ 0 sub ## 16 °--... 6 nodes w/ 176 sub population$isRoot ## [1] TRUE population$height ## [1] 3 population$count ## [1] 7 population$totalCount ## [1] 196 population$attributes ## character(0) population$attributesAll ## [1] "GNI" "continent" "country" "iso3" "population" population$averageBranchingFactor ## [1] 24.375 The naming convention of the package is that attributes and actives are lower case, whereas methods are upper / CamelCase. RStudio and other IDEs work well with data.tree. If you have a Node, simply type myNode$ + SPACE to get a list of available attributes, actives and methods.
OO-Style Methods Examples
Examples of OO-Style methods
You will find more information on these examples below.
Get will traverse the tree and collect specific values for the Nodes it traverses:
sum(population$Get("population", filterFun = isLeaf)) ## [1] 6683146875 Prune traverses the tree and keeps only the subtrees for which the pruneFun returns TRUE.
Prune(population, pruneFun = function(x) !x$isLeaf || x$population > 1000000) ## [1] 39 Note that the Prune function has side-effects, as it acts on the original population object. The population sum is now smaller:
sum(population$Get("population", filterFun = isLeaf), na.rm = TRUE) ## [1] 6669737814 Traditional R Methods
popClone <- Clone(acme) Traditional S3 generics are available especially for conversion:
as.data.frame(acme) ## levelName ## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Software ## 4 ¦ °--New Accounting Standards ## 5 ¦--Research ## 6 ¦ ¦--New Product Line ## 7 ¦ °--New Labs ## 8 °--IT ## 9 ¦--Outsource ## 10 ¦--Go agile ## 11 °--Switch to R Though there is also a more specialised non-generic version:
ToDataFrameNetwork(acme) ## from to ## 1 Acme Inc. Accounting ## 2 Acme Inc. Research ## 3 Acme Inc. IT ## 4 Accounting New Software ## 5 Accounting New Accounting Standards ## 6 Research New Product Line ## 7 Research New Labs ## 8 IT Outsource ## 9 IT Go agile ## 10 IT Switch to R Custom attributes
Just as with, say, a list, we can add any custom field to any Node in a data.tree structure. Let's go back to our acme company:
acme ## levelName ## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Software ## 4 ¦ °--New Accounting Standards ## 5 ¦--Research ## 6 ¦ ¦--New Product Line ## 7 ¦ °--New Labs ## 8 °--IT ## 9 ¦--Outsource ## 10 ¦--Go agile ## 11 °--Switch to R We now add costs and probabilities to the projects in each department:
acme$Accounting$`New Software`$cost <- 1000000 acme$Accounting$`New Accounting Standards`$cost <- 500000 acme$Research$`New Product Line`$cost <- 2000000 acme$Research$`New Labs`$cost <- 750000 acme$IT$Outsource$cost <- 400000 acme$IT$`Go agile`$cost <- 250000 acme$IT$`Switch to R`$cost <- 50000 acme$Accounting$`New Software`$p <- 0.5 acme$Accounting$`New Accounting Standards`$p <- 0.75 acme$Research$`New Product Line`$p <- 0.25 acme$Research$`New Labs`$p <- 0.9 acme$IT$Outsource$p <- 0.2 acme$IT$`Go agile`$p <- 0.05 acme$IT$`Switch to R`$p <- 1 print(acme, "cost", "p") ## levelName cost p ## 1 Acme Inc. NA NA ## 2 ¦--Accounting NA NA ## 3 ¦ ¦--New Software 1000000 0.50 ## 4 ¦ °--New Accounting Standards 500000 0.75 ## 5 ¦--Research NA NA ## 6 ¦ ¦--New Product Line 2000000 0.25 ## 7 ¦ °--New Labs 750000 0.90 ## 8 °--IT NA NA ## 9 ¦--Outsource 400000 0.20 ## 10 ¦--Go agile 250000 0.05 ## 11 °--Switch to R 50000 1.00 Note that there is a list of reserved names you cannot use as Node attributes:
NODE_RESERVED_NAMES_CONST ## [1] "AddChild" "AddChildNode" "AddSibling" ## [4] "AddSiblingNode" "attributes" "attributesAll" ## [7] "averageBranchingFactor" "children" "Climb" ## [10] "Navigate" "FindNode" "clone" ## [13] "count" "Do" "fields" ## [16] "fieldsAll" "Get" "GetAttribute" ## [19] "height" "initialize" "isBinary" ## [22] "isLeaf" "isRoot" "leafCount" ## [25] "leaves" "level" "levelName" ## [28] "name" "parent" "path" ## [31] "pathString" "position" "Prune" ## [34] "Revert" "RemoveAttribute" "RemoveChild" ## [37] "root" "Set" "siblings" ## [40] "Sort" "totalCount" ".*" Custom attributes in constructor
An alternative, often convenient way to assign custom attributes is in the constructor, or in the Node$AddChild method:
birds <- Node$new("Aves", vulgo = "Bird") birds$AddChild("Neognathae", vulgo = "New Jaws", species = 10000) birds$AddChild("Palaeognathae", vulgo = "Old Jaws", species = 60) print(birds, "vulgo", "species") ## levelName vulgo species ## 1 Aves Bird NA ## 2 ¦--Neognathae New Jaws 10000 ## 3 °--Palaeognathae Old Jaws 60 Custom attributes as function
Nothing stops you from setting a function as a field. This calculates a value dynamically, i.e. whenever a field is accessed in tree traversal. For example, you can add a new Node to your structure, and the function will reflect this. Think of this as a hierarchical spreadsheet, in which you can set formulas into cells.
Consider the following example:
birds$species <- function(self) sum(sapply(self$children, function(x) x$species)) print(birds, "species") ## levelName species ## 1 Aves 10060 ## 2 ¦--Neognathae 10000 ## 3 °--Palaeognathae 60 data.tree maps the self argument to the Node at hand. Thus, you must name the argument self.
Now, let's assume we discover a new species. Then, the species on the root adjusts dynamically:
birds$Palaeognathae$species <- 61 print(birds, "species") ## levelName species ## 1 Aves 10061 ## 2 ¦--Neognathae 10000 ## 3 °--Palaeognathae 61 This, together with the Set method and recursion, becomes a very powerful tool, as we'll see later.
Printing
Basic Printing
Basic printing is easy, as you surely have noted in the previous sections. print displays a tree in a tree-grid view. On the left, you have the hierarchy. Then you have a column per variable you want to print:
print(acme, "cost", "p") ## levelName cost p ## 1 Acme Inc. NA NA ## 2 ¦--Accounting NA NA ## 3 ¦ ¦--New Software 1000000 0.50 ## 4 ¦ °--New Accounting Standards 500000 0.75 ## 5 ¦--Research NA NA ## 6 ¦ ¦--New Product Line 2000000 0.25 ## 7 ¦ °--New Labs 750000 0.90 ## 8 °--IT NA NA ## 9 ¦--Outsource 400000 0.20 ## 10 ¦--Go agile 250000 0.05 ## 11 °--Switch to R 50000 1.00 For more advanced printing, you have a few options.
Formatters
You can use formatters to output a variable in a certain way. You can use formatters in two ways:
- You can set them on a
Nodeusing theSetFormatmethod. If you do this, then the formatter will be picked up as a default formatter whenever youprint,Get, convert todata.frame, etc. Formatters can be set on anyNodein adata.treestructure act on any descendant. So you can overwrite a formatter for a sub-tree. - You can add an explicit ad-hoc formatter to the
Getmethod (see below). This will overwrite default formatters previously set via theSetFormatmethod. You can also set the formatter toidentityto void a default formatter.
Setting a formatter using the SetFormat method:
SetFormat(acme, "p", formatFun = FormatPercent) SetFormat(acme, "cost", formatFun = function(x) FormatFixedDecimal(x, digits = 2)) print(acme, "cost", "p") ## levelName cost p ## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Software 1000000.00 50.00 % ## 4 ¦ °--New Accounting Standards 500000.00 75.00 % ## 5 ¦--Research ## 6 ¦ ¦--New Product Line 2000000.00 25.00 % ## 7 ¦ °--New Labs 750000.00 90.00 % ## 8 °--IT ## 9 ¦--Outsource 400000.00 20.00 % ## 10 ¦--Go agile 250000.00 5.00 % ## 11 °--Switch to R 50000.00 100.00 % Printing using Get
Formatting with the Get method overwrites any formatters found along the path:
data.frame(cost = acme$Get("cost", format = function(x) FormatFixedDecimal(x, 2)), p = acme$Get("p", format = FormatPercent)) ## cost p ## Acme Inc. ## Accounting ## New Software 1000000.00 50.00 % ## New Accounting Standards 500000.00 75.00 % ## Research ## New Product Line 2000000.00 25.00 % ## New Labs 750000.00 90.00 % ## IT ## Outsource 400000.00 20.00 % ## Go agile 250000.00 5.00 % ## Switch to R 50000.00 100.00 % Plotting
plot
data.tree is mainly a data structure. As it is easy to convert data.tree structures to other formats, you have access to a large number of tools to plot a data.tree structure. For example, you can plot a data.tree structure as a dendrogram, as an ape tree, as a treeview, etc. Additionally, data.tree also provides its own plotting facility. It is built on GraphViz/DiagrammeR, and you can access these features via the plot and ToGraphViz functions. Note that DiagrammeR is not required to use data.tree, so plot only works if DiagrammeR is installed on your system. For example:
plot(acme)
acme
Styling
Similar to formatters for printing, you can style your tree and store the styling directly in the tree, for later use:
SetGraphStyle(acme, rankdir = "TB") SetEdgeStyle(acme, arrowhead = "vee", color = "grey35", penwidth = 2) SetNodeStyle(acme, style = "filled,rounded", shape = "box", fillcolor = "GreenYellow", fontname = "helvetica", tooltip = GetDefaultTooltip) SetNodeStyle(acme$IT, fillcolor = "LightBlue", penwidth = "5px") plot(acme)
acme
For details on the styling attributes, see http://graphviz.org/Documentation.php .
Note that, by default, most Node style attributes will be inherited. Though, for example, label will not be inherited. However, inheritance can be avoided for all style attributes, as for the Accounting node in the following example:
SetNodeStyle(acme$Accounting, inherit = FALSE, fillcolor = "Thistle", fontcolor = "Firebrick", tooltip = "This is the accounting department") plot(acme)
acme
Use Do to set style on specific nodes:
Do(acme$leaves, function(node) SetNodeStyle(node, shape = "egg")) plot(acme)
acme
Other Visualisations
However, there are also endless other possibilities to visualise data.tree structures. There are more examples in the applications vignette. Type vignette('applications', package = "data.tree").
Dendrogram
For example, using dendrogram:
plot(as.dendrogram(CreateRandomTree(nodes = 20)), center = TRUE)
igraph
Or, using igraph:
library(igraph) plot(as.igraph(acme, directed = TRUE, direction = "climb"))
networkD3
Or, using networkD3: (you can actually touch these thingies and drag them around, don't be shy!)
library(networkD3) acmeNetwork <- ToDataFrameNetwork(acme, "name") simpleNetwork(acmeNetwork[-3], fontSize = 12) Another example, which at the same time shows conversion from csv:
fileName <- system.file("extdata", "useR15.csv", package="data.tree") useRdf <- read.csv(fileName, stringsAsFactors = FALSE) #define the hierarchy (Session/Room/Speaker) useRdf$pathString <- paste("useR", useRdf$session, useRdf$room, useRdf$speaker, sep="|") #convert to Node useRtree <- as.Node(useRdf, pathDelimiter = "|") #plot with networkD3 useRtreeList <- ToListExplicit(useRtree, unname = TRUE) radialNetwork( useRtreeList) Tree Conversion
In order to take advantage of the R eco-system, you can convert your data.tree structure to other oft-used data types. The general rule is that, for each target type, there is a one-does-it-all generics, and a few more specialised conversion functions. For example, in order to convert a data.tree to a data.frame, you can either use as.data.frame.Node, or ToDataFrameTree, ToDataFrameTable, or ToDataFrameNetwork. The documentation for all of these variations is accessible via ?as.data.frame.Node.
Converting to data.frame
As you saw just above, creating a data.frame is easy.
Again, note that we always call such methods on the root Node of a data.tree structure, or on the root Node of a subtree:
acmedf <- as.data.frame(acme) as.data.frame(acme$IT) ## levelName ## 1 IT ## 2 ¦--Outsource ## 3 ¦--Go agile ## 4 °--Switch to R The same can be achieved by using the more specialised method:
ToDataFrameTree(acme) We can also add field values of the Nodes as columns to the data.frame:
ToDataFrameTree(acme, "level", "cost") ## levelName level cost ## 1 Acme Inc. 1 NA ## 2 ¦--Accounting 2 NA ## 3 ¦ ¦--New Software 3 1000000 ## 4 ¦ °--New Accounting Standards 3 500000 ## 5 ¦--Research 2 NA ## 6 ¦ ¦--New Product Line 3 2000000 ## 7 ¦ °--New Labs 3 750000 ## 8 °--IT 2 NA ## 9 ¦--Outsource 3 400000 ## 10 ¦--Go agile 3 250000 ## 11 °--Switch to R 3 50000 Note that it is not required that the field is set on each and every Node.
Other data frame conversions are:
ToDataFrameTable(acme, "pathString", "cost") ## pathString cost ## 1 Acme Inc./Accounting/New Software 1000000 ## 2 Acme Inc./Accounting/New Accounting Standards 500000 ## 3 Acme Inc./Research/New Product Line 2000000 ## 4 Acme Inc./Research/New Labs 750000 ## 5 Acme Inc./IT/Outsource 400000 ## 6 Acme Inc./IT/Go agile 250000 ## 7 Acme Inc./IT/Switch to R 50000 ToDataFrameNetwork(acme, "cost") ## from to cost ## 1 Acme Inc. Accounting NA ## 2 Acme Inc. Research NA ## 3 Acme Inc. IT NA ## 4 Accounting New Software 1000000 ## 5 Accounting New Accounting Standards 500000 ## 6 Research New Product Line 2000000 ## 7 Research New Labs 750000 ## 8 IT Outsource 400000 ## 9 IT Go agile 250000 ## 10 IT Switch to R 50000 And, finally, we can also put attributes of our nodes in a column, based on a type discriminator. This sounds more complicated then what it is. Consider the default discriminator, level:
ToDataFrameTypeCol(acme, 'cost') ## level_1 level_2 level_3 cost ## 1 Acme Inc. Accounting New Software 1000000 ## 2 Acme Inc. Accounting New Accounting Standards 500000 ## 3 Acme Inc. Research New Product Line 2000000 ## 4 Acme Inc. Research New Labs 750000 ## 5 Acme Inc. IT Outsource 400000 ## 6 Acme Inc. IT Go agile 250000 ## 7 Acme Inc. IT Switch to R 50000 Let's look at a somewhat more advanced example. First, let's assume that for the outsourcing project, we have two separate possibilities: Outsourcing to India or outsourcing to Poland:
acme$IT$Outsource$AddChild("India") acme$IT$Outsource$AddChild("Poland") Now, with this slightly more complex tree structure, the level is not a usefully discriminator anymore, because some projects are in level 3, while the new projects are in level 4. For this reason, we introduce a type field on our node objects: A node type can be a company (root only), a department (Accounting, Research, and IT), a program (Oursource), and a project (the rest, i.e. all the leaves):
acme$Set(type = c('company', 'department', 'project', 'project', 'department', 'project', 'project', 'department', 'program', 'project', 'project', 'project', 'project')) Our tree now looks like this:
print(acme, 'type') ## levelName type ## 1 Acme Inc. company ## 2 ¦--Accounting department ## 3 ¦ ¦--New Software project ## 4 ¦ °--New Accounting Standards project ## 5 ¦--Research department ## 6 ¦ ¦--New Product Line project ## 7 ¦ °--New Labs project ## 8 °--IT department ## 9 ¦--Outsource program ## 10 ¦ ¦--India project ## 11 ¦ °--Poland project ## 12 ¦--Go agile project ## 13 °--Switch to R project We can now create a data.frame in which we have one column per distinct type value. Namely, a company column, a department column, a program column, and a project column. Note that the columns are not hardcoded, but derived dynamically from your data in the tree structure:
ToDataFrameTypeCol(acme, type = 'type', prefix = NULL) ## company department program project ## 1 Acme Inc. Accounting <NA> New Software ## 2 Acme Inc. Accounting <NA> New Accounting Standards ## 3 Acme Inc. Research <NA> New Product Line ## 4 Acme Inc. Research <NA> New Labs ## 5 Acme Inc. IT Outsource India ## 6 Acme Inc. IT Outsource Poland ## 7 Acme Inc. IT <NA> Go agile ## 8 Acme Inc. IT <NA> Switch to R Converting to List of Lists
List of lists are useful for various use cases:
- as an intermediate step in converting to JSON, XML, YAML
- for functions that take a lol as an input. This is especially the case for visualisations and charts, e.g with many html widgets
- to save a
data.treestructure as an R object (see performance considerations below)
data(acme) str(as.list(acme$IT)) ## List of 3 ## $ Outsource :List of 2 ## ..$ cost: num 4e+05 ## ..$ p : num 0.2 ## $ Go agile :List of 2 ## ..$ cost: num 250000 ## ..$ p : num 0.05 ## $ Switch to R:List of 2 ## ..$ cost: num 50000 ## ..$ p : num 1 str(ToListExplicit(acme$IT, unname = FALSE, nameName = "id", childrenName = "dependencies")) ## List of 2 ## $ id : chr "IT" ## $ dependencies:List of 3 ## ..$ Outsource :List of 3 ## .. ..$ id : chr "Outsource" ## .. ..$ cost: num 4e+05 ## .. ..$ p : num 0.2 ## ..$ Go agile :List of 3 ## .. ..$ id : chr "Go agile" ## .. ..$ cost: num 250000 ## .. ..$ p : num 0.05 ## ..$ Switch to R:List of 3 ## .. ..$ id : chr "Switch to R" ## .. ..$ cost: num 50000 ## .. ..$ p : num 1 Converting to other objects
There are also conversions to igraph objects, to phylo / ape, to dendrogram, and others. For details, see ?as.phylo.Node, ?as.dendrogram.Node, ?as.igraph.Node.
Tree Traversal
Tree traversal is one of the core concepts of trees. See, for example, here: Tree Traversal on Wikipedia.
Get
The Get method traverses the tree and collects values from each node. It then returns a vector or a list, containing the collected values.
Additional features of the Get method are:
- execute a function on each node, and append the function's result to the returned vector
- execute a
Nodemethod on each node, and append the method's return value to the returned vector
Traversal order
The Get method can traverse the tree in various ways. This is called traversal order.
Pre-Order
The default traversal mode is pre-order.
pre-order
This is what is used e.g. in print:
print(acme, "level") ## levelName level ## 1 Acme Inc. 1 ## 2 ¦--Accounting 2 ## 3 ¦ ¦--New Software 3 ## 4 ¦ °--New Accounting Standards 3 ## 5 ¦--Research 2 ## 6 ¦ ¦--New Product Line 3 ## 7 ¦ °--New Labs 3 ## 8 °--IT 2 ## 9 ¦--Outsource 3 ## 10 ¦--Go agile 3 ## 11 °--Switch to R 3 Post-Order
The post-order traversal mode returns children first, returning parents only after all its children have been traversed and returned:
post-order
We can use it like this on the Get method:
acme$Get('level', traversal = "post-order") ## New Software New Accounting Standards Accounting ## 3 3 2 ## New Product Line New Labs Research ## 3 3 2 ## Outsource Go agile Switch to R ## 3 3 3 ## IT Acme Inc. ## 2 1 This is useful if your parent's value depends on the children, as we'll see below.
Ancestor
This is a non-standard traversal mode that does not traverse the entire tree. Instead, the ancestor mode starts from a Node, then walks the tree along the path from parent to parent, up to the root.
data.frame(level = agile$Get('level', traversal = "ancestor")) ## level ## Go agile 3 ## IT 2 ## Acme Inc. 1 Filter and Prune
You can add a filter and/or a prune function to the Get method. These functions have to take a Node as an input, and return TRUE if the Node should be considered, and FALSE otherwise. The difference between the pruneFun and the filterFun is that filters act only on specific nodes, whereas if the pruneFun returns FALSE, then the entire sub-tree spanned by the Node is ignored.
For example:
acme$Get('name', pruneFun = function(x) x$position <= 2) ## Acme Inc. Accounting ## "Acme Inc." "Accounting" ## New Software New Accounting Standards ## "New Software" "New Accounting Standards" ## Research New Product Line ## "Research" "New Product Line" ## New Labs ## "New Labs" There are also some convenient filter functions available in the package, such as isLeaf, isRoot, isNotLeaf, etc.
acme$Get('name', filterFun = isLeaf) ## New Software New Accounting Standards ## "New Software" "New Accounting Standards" ## New Product Line New Labs ## "New Product Line" "New Labs" ## Outsource Go agile ## "Outsource" "Go agile" ## Switch to R ## "Switch to R" Attributes
The attribute parameter determines what is collected. This is called attribute, but it should not be confused with R's concept of object attributes (e.g.?attributes). In this context, an attribute can be either:
- the name of a
Nodefield - the name of a
Nodemethod or active - a function, whose first argument must be a Node
Throughout this document, we refer to attribute in this sense.
Field
acme$Get('name') ## Acme Inc. Accounting ## "Acme Inc." "Accounting" ## New Software New Accounting Standards ## "New Software" "New Accounting Standards" ## Research New Product Line ## "Research" "New Product Line" ## New Labs IT ## "New Labs" "IT" ## Outsource Go agile ## "Outsource" "Go agile" ## Switch to R ## "Switch to R" Method
You can pass a standard R function to the Get method (and thus to print, as.data.frame, etc.). The only requirement this function must satisfy is that its first argument be of class Node. Subsequent arguments can be added through the ellipsis (…). For example:
ExpectedCost <- function(node, adjustmentFactor = 1) { return ( node$cost * node$p * adjustmentFactor) } acme$Get(ExpectedCost, adjustmentFactor = 0.9, filterFun = isLeaf) ## New Software New Accounting Standards New Product Line ## 450000 337500 450000 ## New Labs Outsource Go agile ## 607500 72000 11250 ## Switch to R ## 45000 Using recursion
Recursion comes naturally with data.tree, and it is one of its core strengths:
Cost <- function(node) { result <- node$cost if(length(result) == 0) result <- sum(sapply(node$children, Cost)) return (result) } print(acme, "p", cost = Cost) ## levelName p cost ## 1 Acme Inc. NA 4950000 ## 2 ¦--Accounting NA 1500000 ## 3 ¦ ¦--New Software 0.50 1000000 ## 4 ¦ °--New Accounting Standards 0.75 500000 ## 5 ¦--Research NA 2750000 ## 6 ¦ ¦--New Product Line 0.25 2000000 ## 7 ¦ °--New Labs 0.90 750000 ## 8 °--IT NA 700000 ## 9 ¦--Outsource 0.20 400000 ## 10 ¦--Go agile 0.05 250000 ## 11 °--Switch to R 1.00 50000 There is a built-in function that would make this example even simpler: Aggregate. It is explained below.
Do
Do is similar to Get in that it also traverses a tree in a specific traversal order. However, instead of fetching an attribute, it will (surprise!) do something, namely run a function. For example, we can tell the Do method to assign a value to each Node it traverses. This is especially useful if the attribute parameter is a function, as in the previous examples. For instance, we can store the aggregated cost for later use and printing:
acme$Do(function(node) node$cost <- Cost(node), filterFun = isNotLeaf) print(acme, "p", "cost") ## levelName p cost ## 1 Acme Inc. NA 4950000 ## 2 ¦--Accounting NA 1500000 ## 3 ¦ ¦--New Software 0.50 1000000 ## 4 ¦ °--New Accounting Standards 0.75 500000 ## 5 ¦--Research NA 2750000 ## 6 ¦ ¦--New Product Line 0.25 2000000 ## 7 ¦ °--New Labs 0.90 750000 ## 8 °--IT NA 700000 ## 9 ¦--Outsource 0.20 400000 ## 10 ¦--Go agile 0.05 250000 ## 11 °--Switch to R 1.00 50000 Set
The Set method is the counterpart to the Get method. The Set method takes a vector or a single value as an input, and traverses the tree in a certain order. Each Node is assigned a value from the vector, one after the other, recycling.
Assigning values
acme$Set(id = 1:acme$totalCount) print(acme, "id") ## levelName id ## 1 Acme Inc. 1 ## 2 ¦--Accounting 2 ## 3 ¦ ¦--New Software 3 ## 4 ¦ °--New Accounting Standards 4 ## 5 ¦--Research 5 ## 6 ¦ ¦--New Product Line 6 ## 7 ¦ °--New Labs 7 ## 8 °--IT 8 ## 9 ¦--Outsource 9 ## 10 ¦--Go agile 10 ## 11 °--Switch to R 11 The Set method can take multiple vectors as an input, and, optionally, you can define the name of the attribute. Finally, just as for the Get method, the traversal order is important for the Set.
secretaries <- c(3, 2, 8) employees <- c(52, 43, 51) acme$Set(secretaries, emps = employees, filterFun = function(x) x$level == 2) print(acme, "emps", "secretaries", "id") ## levelName emps secretaries id ## 1 Acme Inc. NA NA 1 ## 2 ¦--Accounting 52 3 2 ## 3 ¦ ¦--New Software NA NA 3 ## 4 ¦ °--New Accounting Standards NA NA 4 ## 5 ¦--Research 43 2 5 ## 6 ¦ ¦--New Product Line NA NA 6 ## 7 ¦ °--New Labs NA NA 7 ## 8 °--IT 51 8 8 ## 9 ¦--Outsource NA NA 9 ## 10 ¦--Go agile NA NA 10 ## 11 °--Switch to R NA NA 11 Deleting attributes
The Set method can also be used to assign a single value directly to all Nodes traversed. For example, to remove the avgExpectedCost, we assign NULL on each node, using the fact that the Set recycles:
acme$Set(avgExpectedCost = NULL) However, note that setting a field to NULL will not make it gone for good. You will still see it:
acme$attributesAll ## [1] "avgExpectedCost" "cost" "id" "emps" ## [5] "secretaries" "p" In order remove it completely, you can use the RemoveAttribute method:
acme$Do(function(node) node$RemoveAttribute("avgExpectedCost")) Using Set and function assignment
Earlier, we saw that we can add a function dynamically to a Node. We can, of course, also do this via the Set method
acme$Set(cost = c(function(self) sum(sapply(self$children, function(child) GetAttribute(child, "cost")))), filterFun = isNotLeaf) print(acme, "cost") ## levelName cost ## 1 Acme Inc. 4950000 ## 2 ¦--Accounting 1500000 ## 3 ¦ ¦--New Software 1000000 ## 4 ¦ °--New Accounting Standards 500000 ## 5 ¦--Research 2750000 ## 6 ¦ ¦--New Product Line 2000000 ## 7 ¦ °--New Labs 750000 ## 8 °--IT 700000 ## 9 ¦--Outsource 400000 ## 10 ¦--Go agile 250000 ## 11 °--Switch to R 50000 acme$IT$AddChild("Paperless", cost = 240000) print(acme, "cost") ## levelName cost ## 1 Acme Inc. 5190000 ## 2 ¦--Accounting 1500000 ## 3 ¦ ¦--New Software 1000000 ## 4 ¦ °--New Accounting Standards 500000 ## 5 ¦--Research 2750000 ## 6 ¦ ¦--New Product Line 2000000 ## 7 ¦ °--New Labs 750000 ## 8 °--IT 940000 ## 9 ¦--Outsource 400000 ## 10 ¦--Go agile 250000 ## 11 ¦--Switch to R 50000 ## 12 °--Paperless 240000 Traverse and explicit traversal
Previously, we have used the Get, Set and Do methods in their OO-style version. This is often very convenient for quick access to variables. However, sometimes you want to re-use the same traversal for multiple sequential operations. For this, you can use what is called explicit traversal. It works like so:
traversal <- Traverse(acme, traversal = "post-order", filterFun = function(x) x$level == 2) Set(traversal, floor = c(1, 2, 3)) Do(traversal, function(x) { if (x$floor <= 2) { x$extension <- "044" } else { x$extension <- "043" } }) Get(traversal, "extension") ## Accounting Research IT ## "044" "044" "043" Advanced Features
Aggregate
The Aggregate method provides a shorthand for the oft-used case when a parent is the aggregate of its child values, as seen in the previous example. Aggregate calls a function recursively on children. If a child holds the attribute, that value is returned. Otherwise, the attribute is collected from all children, and aggregated using the aggFun. For example:
Aggregate(node = acme, attribute = "cost", aggFun = sum) ## [1] 5190000 We can also use this in the Get method, of course:
acme$Get(Aggregate, "cost", sum) Note, however, that this is not very efficient: Aggregate will be called twice on, say, IT: Once when the traversal passes IT itself, the second time recursively when Aggregate is called on the root. For this reason, we have the option to store/cache the calculated value along the way. For one thing, this is a convenient way to save an additional Set call in case we want to store the aggregated value. Additionally, it speeds up calculation because Aggregate on an ancestor will use a cached value on a descendant:
acme$Do(function(node) node$cost <- Aggregate(node, attribute = "cost", aggFun = sum), traversal = "post-order") print(acme, "cost") ## levelName cost ## 1 Acme Inc. 5190000 ## 2 ¦--Accounting 1500000 ## 3 ¦ ¦--New Software 1000000 ## 4 ¦ °--New Accounting Standards 500000 ## 5 ¦--Research 2750000 ## 6 ¦ ¦--New Product Line 2000000 ## 7 ¦ °--New Labs 750000 ## 8 °--IT 940000 ## 9 ¦--Outsource 400000 ## 10 ¦--Go agile 250000 ## 11 ¦--Switch to R 50000 ## 12 °--Paperless 240000 Cumulate
In its simplest form, the Cumulate function just sums up an attribute value along siblings, taking into consideration all siblings before the Node on which Cumulate is called:
Cumulate(acme$IT$`Go agile`, "cost", sum) ## [1] 650000 Or, to find the minimum cost among siblings:
Cumulate(acme$IT$`Go agile`, "cost", min) ## [1] 250000 This can be useful in combination with traversal, e.g. to calculate a running sum among siblings. Specifically, the cacheAttribute lets you store the running sum in a field. This not only speeds up calculation, but lets you re-use the calculated values later:
acme$Do(function(node) node$cumCost <- Cumulate(node, attribute = "cost", aggFun = sum)) print(acme, "cost", "cumCost") ## levelName cost cumCost ## 1 Acme Inc. 5190000 5190000 ## 2 ¦--Accounting 1500000 1500000 ## 3 ¦ ¦--New Software 1000000 1000000 ## 4 ¦ °--New Accounting Standards 500000 1500000 ## 5 ¦--Research 2750000 4250000 ## 6 ¦ ¦--New Product Line 2000000 2000000 ## 7 ¦ °--New Labs 750000 2750000 ## 8 °--IT 940000 5190000 ## 9 ¦--Outsource 400000 400000 ## 10 ¦--Go agile 250000 650000 ## 11 ¦--Switch to R 50000 700000 ## 12 °--Paperless 240000 940000 Clone
As stated above, Nodes exhibit reference semantics. If you call, say, Set, then this changes the Nodes in the tree. The changes will be visible for all variables having a reference on the data.tree structure. As a consequence, you might want to "save away" the current state of a structure. To do this, you can Clone an entire tree:
acmeClone <- Clone(acme) acmeClone$name <- "New Acme" # acmeClone does not point to the same reference object anymore: acme$name == acmeClone$name ## [1] FALSE Sort
With the Sort method, you can sort an entire tree, a sub-tree, or children of a specific Node. The method will sort recursively and sort children with respect to a child attribute. As explained earlier, the child attribute can be a function or a method.
Sort(acme, "name") acme ## levelName ## 1 Acme Inc. ## 2 ¦--Accounting ## 3 ¦ ¦--New Accounting Standards ## 4 ¦ °--New Software ## 5 ¦--IT ## 6 ¦ ¦--Go agile ## 7 ¦ ¦--Outsource ## 8 ¦ ¦--Paperless ## 9 ¦ °--Switch to R ## 10 °--Research ## 11 ¦--New Labs ## 12 °--New Product Line Sort(acme, Aggregate, "cost", sum, decreasing = TRUE, recursive = TRUE) print(acme, "cost", aggCost = acme$Get(Aggregate, "cost", sum)) ## levelName cost aggCost ## 1 Acme Inc. 5190000 5190000 ## 2 ¦--Research 2750000 2750000 ## 3 ¦ ¦--New Product Line 2000000 2000000 ## 4 ¦ °--New Labs 750000 750000 ## 5 ¦--Accounting 1500000 1500000 ## 6 ¦ ¦--New Software 1000000 1000000 ## 7 ¦ °--New Accounting Standards 500000 500000 ## 8 °--IT 940000 940000 ## 9 ¦--Outsource 400000 400000 ## 10 ¦--Go agile 250000 250000 ## 11 ¦--Paperless 240000 240000 ## 12 °--Switch to R 50000 50000 Prune
You can prune sub-trees out of a tree, by that removing an entire sub-tree from a tree. There are two variations of this: * temporary pruning, e.g. just for printing: This is the pruneFun parameter, e.g. in Get * side effect or permanent pruning, meaning that you modify your data.tree structure for good. This is achieved with the Prune method.
Consider the following example of permanent pruning:
acme$Do(function(x) x$cost <- Aggregate(x, "cost", sum)) Prune(acme, function(x) x$cost > 700000) ## [1] 5 print(acme, "cost") ## levelName cost ## 1 Acme Inc. 5190000 ## 2 ¦--Research 2750000 ## 3 ¦ ¦--New Product Line 2000000 ## 4 ¦ °--New Labs 750000 ## 5 ¦--Accounting 1500000 ## 6 ¦ °--New Software 1000000 ## 7 °--IT 940000 Performance Considerations
CPU
The data.tree package has been built to work with hierarchical data, to support visualization, to foster rapid prototyping, and for other applications where development time saved is more important than computing time lost. Having said this, it becomes clear that big data and data.tree do not marry particularly well. Don't expect R to build your data.tree structure with a few million Nodes during your cigarette break. Do not try to convert a gigabyte JSON document to a data.tree structure in a testthat test case.
However, if you are respecting the following guidelines, I promise that you and your Nodes will have a lot of fun together. So here it goes:
- Creating a
Nodeis relatively expensive.CreateRegularTree(6, 6)creates adata.treestructure with 9331Nodes. On an AWS c4.large instance, this takes about 2.5 seconds. -
Cloneis similar toNodecreation, with an extra penalty of about 50%. - Traversing (
Traverse,Get,SetandDo) is relatively cheap.
This is really what you would expect. data.tree builds on R6, i.e. reference objects. There is an overhead in creating them, as your computer needs to manage the references they hold. However, performing operations that change your tree (e.g.Prune or Set) are often faster than value semantics, as your computer does not need to copy the entire object in memory.
Just to give you an order of magnitude: The following times are achieved on an AWS c4.large instance:
system.time(tree <- CreateRegularTree(6, 6)) ## user system elapsed ## 2.499 0.009 2.506 system.time(tree <- Clone(tree)) ## user system elapsed ## 3.704 0.023 3.726 system.time(traversal <- Traverse(tree)) ## user system elapsed ## 0.096 0.000 0.097 system.time(Set(traversal, id = 1:tree$totalCount)) ## user system elapsed ## 0.205 0.000 0.204 system.time(ids <- Get(traversal, "id")) ## user system elapsed ## 0.569 0.000 0.569 leaves <- Traverse(tree, filterFun = isLeaf) Set(leaves, leafId = 1:length(leaves)) system.time(Get(traversal, function(node) Aggregate(node, "leafId", max))) ## user system elapsed ## 1.418 0.000 1.417 With caching, you can save some time:
system.time(tree$Get(function(node) Aggregate(tree, "leafId", max, "maxLeafId"), traversal = "post-order")) ## user system elapsed ## 0.69 0.00 0.69 Memory
data.tree structures have a relatively large memory footprint. However, for every-day applications using modern computers, this will not normally have an impact on your work except when saving a data.tree structure to disk.
For an explanation why that is the case, you might want to read this answer on Stack Overflow.
Depending on your development environment, you might want to turn off the option to save the workspace to .RData on exit.
Source: https://cran.r-project.org/package=data.tree/vignettes/data.tree.html
0 Response to "how to draw a tree diagram in r"
Post a Comment