Code Property Graph Specification 1.1
Contributors: Fabian Yamaguchi, Markus Lottmann, Niko Schmidt, Michael Pollmeier, Suchakra Sharma, Claudiu-Vlad Ursache.
This is the specification of the Code Property Graph, a language-agnostic intermediate graph representation of code designed for code querying.
The code property graph is a directed, edge-labeled, attributed multigraph. This specification provides the graph schema, that is, the types of nodes and edges and their properties, as well as constraints that specify which source and destination nodes are permitted for each edge type.
The graph schema is structured into multiple layers, each of which provide node, property, and edge type definitions. A layer may depend on multiple other layers and make use of the types it provides.
In the following, we describe each layer in detail. Note that this specification faithfully represents the code property graph as implemented by the Joern static analysis framework, as it is generated from its code.
MetaData
The Meta Data Layer contains information about CPG creation. In particular, it indicates which language frontend generated the CPG and which overlays have been applied. The layer consists of a single node - the Meta Data node - and language frontends MUST create this node. Overlay creators MUST edit this node to indicate that a layer has been successfully applied in all cases where applying the layer more than once is prohibitive.
META_DATA
LANGUAGE
OVERLAYS
ROOT
FileSystem
CPGs are created from sets of files and the File System Layer describes the layout of these files, that is, it provides information about source files and shared objects for source-based and machine-code-based frontends respectively. The purpose of including this information in the CPG is to allow nodes of the graph to be mapped back to file system locations.
FILE
SOURCE_FILE
COLUMN_NUMBER
COLUMN_NUMBER_END
FILENAME
LINE_NUMBER
LINE_NUMBER_END
Namespace
Many programming languages allow code to be structured into namespaces. The Namespace Layer makes these namespaces explicit and associates program constructs with the namespaces they are defined in.
NAMESPACE
NAMESPACE_BLOCK
Method
The Method Layer contains declarations of methods, functions, and procedures. Input parameters and output parameters (including return parameters) are represented, however, method contents is not present in this layer.
METHOD
METHOD_PARAMETER_IN
METHOD_PARAMETER_OUT
METHOD_RETURN
IS_VARIADIC
SIGNATURE
Type
The Type Layer contains information about type declarations, relations between types, and type instantiation and usage. In its current form, it allows modelling of parametrized types, type hierarchies and aliases.
MEMBER
TYPE
TYPE_ARGUMENT
TYPE_DECL
TYPE_PARAMETER
ALIAS_OF
BINDS_TO
INHERITS_FROM
ALIAS_TYPE_FULL_NAME
INHERITS_FROM_TYPE_FULL_NAME
TYPE_DECL_FULL_NAME
TYPE_FULL_NAME
Ast
The Abstract Syntax Tree (AST) Layer provides syntax trees for all compilation units. All nodes of the tree inherit from the same base class (`AST_NODE`) and are connected to their child nodes via outgoing `AST` edges. Syntax trees are typed, that is, when possible, types for all expressions are stored in the tree. Moreover, common control structure types are defined in the specification, making it possible to translate trees into corresponding control flow graphs if only these common control structure types are used, possibly by desugaring on the side of the language frontend. For cases where this is not an option, the AST specification provides means of storing language-dependent information in the AST that can be interpreted by language-dependent control flow construction passes. This layer MUST be created by the frontend.
AST_NODE
BLOCK
CALL
CALL_REPR
CONTROL_STRUCTURE
EXPRESSION
FIELD_IDENTIFIER
IDENTIFIER
JUMP_LABEL
JUMP_TARGET
LITERAL
LOCAL
METHOD_REF
MODIFIER
RETURN
UNKNOWN
AST
CONDITION
CANONICAL_NAME
CONTROL_STRUCTURE_TYPE
MODIFIER_TYPE
ORDER
CallGraph
The Call Graph Layer represents call relations between methods.
ARGUMENT
CALL
RECEIVER
ARGUMENT_INDEX
ARGUMENT_NAME
DISPATCH_TYPE
EVALUATION_STRATEGY
METHOD_FULL_NAME
Cfg
The Control Flow Graph Layer provides control flow graphs for each method. Control flow graphs are constructed by marking a sub set of the abstract syntax tree nodes as control flow nodes (`CFG_NODE`) and connecting these nodes via `CFG` edges. The control flow graph models both the control flow within the calculation of an expression as well as from expression to expression. The layer can be automatically generated from the syntax tree layer if only control structure types supported by this specification are employed.
CFG_NODE
CFG
Dominators
The Dominators Layer provides dominator- and post-dominator trees for all methods. It is constructed automatically from the control flow graph layer and is in turn used to automatically construct control dependence relations of the PDG layer.
DOMINATE
POST_DOMINATE
Pdg
The Program Dependence Graph Layer contains a program dependence graph for each method of the source program. A program dependence graph consists of a data dependence graph (DDG) and a control dependence graph (CDG), created by connecting nodes of the control flow graph via `REACHING_DEF` and `CDG` edges respectively.
CDG
REACHING_DEF
VARIABLE
Finding
We allow findings (e.g., potential vulnerabilities, notes on dangerous practices) to be stored in the Findings Layer.
FINDING
KEY
Shortcuts
The Shortcuts Layer provides shortcut edges calculated to speed up subsequent queries. Language frontends MUST NOT create shortcut edges.
CONTAINS
EVAL_TYPE
PARAMETER_LINK
TagsAndLocation
The Code Property Graph specification allows for tags to be attached to arbitrary nodes. Conceptually, this is similar to the creation of Finding nodes, however, tags are to be used for intermediate results rather than end-results that are to be reported to the user.
LOCATION
TAGGED_BY
CLASS_NAME
CLASS_SHORT_NAME
METHOD_SHORT_NAME
NODE_LABEL
PACKAGE_NAME
SYMBOL
Configuration
The code property graph specification currently does not contain schema elements for the representation of configuration files in a structured format, however, it does allow configuration files to be included verbatim in the graph to enable language-/framework- specific passes to access them. This layer provides the necessary schema elements for this basic support of configuration files.
Binding
We use the concept of "bindings" to support resolving of (method-name, signature) pairs at type declarations (`TYPE_DECL`). For each pair that we can resolve, we create a `BINDING` node that is connected the the type declaration via an incoming `BINDS` edge. The `BINDING` node is connected to the method it resolves to via an outgoing `REF` edge.
BINDING
BINDS
Annotation
Java Annotation related CPG definitions.
ANNOTATION
ANNOTATION_LITERAL
ANNOTATION_PARAMETER_ASSIGN
ARRAY_INITIALIZER
Base