Code Property Graph Specification 1.1
Contributors: Fabian Yamaguchi, Markus Lottmann, Niko Schmidt, Michael Pollmeier, Suchakra Sharma, Claudiu-Vlad Ursache.
This is the specification of the Code Property Graph, a language-agnostic intermediate graph representation of code designed for code querying.
The code property graph is a directed, edge-labeled, attributed multigraph. This specification provides the graph schema, that is, the types of nodes and edges and their properties, as well as constraints that specify which source and destination nodes are permitted for each edge type.
The graph schema is structured into multiple layers, each of which provide node, property, and edge type definitions. A layer may depend on multiple other layers and make use of the types it provides.
In the following, we describe each layer in detail. Note that this specification faithfully represents the code property graph as implemented by the Joern static analysis framework, as it is generated from its code.
The Meta Data Layer contains information about CPG creation. In particular, it indicates which language frontend generated the CPG and which overlays have been applied. The layer consists of a single node - the Meta Data node - and language frontends MUST create this node. Overlay creators MUST edit this node to indicate that a layer has been successfully applied in all cases where applying the layer more than once is prohibitive.
CPGs are created from sets of files and the File System Layer describes the layout of these files, that is, it provides information about source files and shared objects for source-based and machine-code-based frontends respectively. The purpose of including this information in the CPG is to allow nodes of the graph to be mapped back to file system locations.
Many programming languages allow code to be structured into namespaces. The Namespace Layer makes these namespaces explicit and associates program constructs with the namespaces they are defined in.
The Method Layer contains declarations of methods, functions, and procedures. Input parameters and output parameters (including return parameters) are represented, however, method contents is not present in this layer.
The Type Layer contains information about type declarations, relations between types, and type instantiation and usage. In its current form, it allows moedling of parametrized types, type hierarchies and aliases.
The Abstract Syntax Tree (AST) Layer provides syntax trees for all compilation units. All nodes of the tree inherit from the same base class (`AST_NODE`) and are connected to their child nodes via outgoing `AST` edges. Syntax trees are typed, that is, when possible, types for all expressions are stored in the tree. Moreover, common control structure types are defined in the specification, making it possible to translate trees into corresponding control flow graphs if only these common control structure types are used, possibly by desugaring on the side of the language frontend. For cases where this is not an option, the AST specification provides means of storing language-dependent information in the AST that can be interpreted by language-dependent control flow construction passes. This layer MUST be created by the frontend.
The Call Graph Layer represents call relations between methods.
The Control Flow Graph Layer provides control flow graphs for each method. Control flow graphs are constructed by marking a sub set of the abstract syntax tree nodes as control flow nodes (`CFG_NODE`) and connecting these nodes via `CFG` edges. The control flow graph models both the control flow within the calculation of an expression as well as from expression to expression. The layer can be automatically generated from the syntax tree layer if only control structure types supported by this specification are employed.
The Dominators Layer provides dominator- and post-dominator trees for all methods. It is constructed automatically from the control flow graph layer and is in turn used to automatically construct control dependence relations of the PDG layer.
The Program Dependence Graph Layer contains a program dependence graph for each method of the source program. A program dependence graph consists of a data dependence graph (DDG) and a control dependence graph (CDG), created by connecting nodes of the control flow graph via `REACHING_DEF` and `CDG` edges respectively.
We allow findings (e.g., potential vulnerabilities, notes on dangerous practices) to be stored in the Findings Layer.
The Shortcuts Layer provides shortcut edges calculated to speed up subsequent queries. Language frontends MUST NOT create shortcut edges.
The Code Property Graph specification allows for tags to be attached to arbitrary nodes. Conceptually, this is similar to the creation of Finding nodes, however, tags are to be used for intermediate results rather than end-results that are to be reported to the user.
The code property graph specification currently does not contain schema elements for the representation of configuration files in a structured format, however, it does allow configuration files to be included verbatim in the graph to enable language-/framework- specific passes to access them. This layer provides the necessary schema elements for this basic support of configuration files.
We use the concept of "bindings" to support resolving of (method-name, signature) pairs at type declarations (`TYPE_DECL`). For each pair that we can resolve, we create a `BINDING` node that is connected the the type declaration via an incoming `BINDS` edge. The `BINDING` node is connected to the method it resolves to via an outgoing `REF` edge.