The representation of transformations in the BIRD database is based on the SDMX information model (see section II, 13.2 Model – Inheritance View, 13.2.1 Class Diagram).
According to the VTL model, a Transformation scheme is an ordered list of transformations. Therefore such a transformation scheme contains one or many transformations (i.e. one line of valid VTL code). The SDMX model specifies that transformations can contain one or many transformation nodes (i.e. the components of this line of valid VTL code). Therefore a transformation element is either a constant, an operation or a BIRD model object (i.e. a variable, cube, etc.). In case the transformation element represents an operation such a transformation element itself can have a relation to one or many transformation elements.
The BIRD database contains the complete information about transformation schemes in the sense that not only the decomposition of each transformation scheme into its transformations but also the decomposition of transformations into its transformation nodes according to the SDMX information model is stored in the database.
The below sections “Example” and “Representation in the database” explain the relation between transformation schemes, transformations and transformation nodes and their representation in the database.
Conventions used in the BIRD transformation rules / deviations from the VTL specification
special left joins
The reason for not using the left_join operator as defined in the VTL specification is that if we would do so we would have to handle instruments in different ways, depending on their relationship to other objects, e.g. protections. Take as an example a loan associated to a protection and another loan without a protection. If we would use the left_join operator we cannot (starting from the loan) generate a data set that comprises both loans including protection information (because the second loan has no protection). From the perspective of the aggregated output layer (e.g. FinRep) there is no distinction between a non-existing protection and a protection with value 0.
Please consider the following example:
We consider three loans:
Loan | ||
---|---|---|
loanId | Currency | value |
a | EUR | 13 |
b | USD | 17 |
C | EUR | 19 |
And some protections:
Protection | |
---|---|
protectionId | value |
1 | 3 |
2 | 5 |
3 | 7 |
Now two of these loans are linked to protections (i.e. a and c) while the third one is not:
instrumentProtection | |
---|---|
loanId | protectionId |
a | 1 |
a | 2 |
c | 1 |
A normal left_join would omit the loan that is not linked to a protection and because we don’t want to treat the loans differently (at least from a business perspective) we specify our own special left join operator (for further details about the operator please take a look at its specification in the Transformation package):
instrumentsAndTheirProtections := specialLeftJoin(loan, instrumentProtection); | |||
---|---|---|---|
loanId | currency | value | protectionId |
a | EUR | 6.50 | 1 |
a | EUR | 6.50 | 2 |
b | USD | 17.00 | NULL |
c | EUR | 19.00 | 1 |
This way we can describe the data production process without a technical distinction between loans with protections and loans without protections.
specialLeftJoin(instrumentsAndTheirProtections, protection); | ||||
---|---|---|---|---|
loanId | protectionId | currency | value | protection_value |
a | 1 | EUR | 6.50 | 3 |
a | 2 | EUR | 6.50 | 5 |
b | NULL | USD | 17.00 | NULL |
c | 1 | EUR | 19.00 | 3 |
keep with identifiers
Contrary to the VTL specification we allow identifiers to be listed as arguments of a keep operation. We also use the keep operator to extract a list with unique vales, e.g. by the expression result := someData[keep(identifierOfTheDataSet)]; we indicate that the result should be a dataset having only one column with unique values similiar to a SELECT DISTINCT identifierOfTheDataSet FROM someData.
calc used for casting to identifier
The reason why we use the calc operator for describing such an expression is that we didn’t find a cast operator in the VTL specification.
User defined operators with multiple expressions (not comprised in one expression)
According to the VTL specification only one output parameter per user defined operator is allowed, and theoretically it should be possible to merge multiple expressions into one expression (e.g. s := k * x and y := s + d -> y := (k * x) + d), however we do not comply with this constraint for two reasons:
(a) The official grammar is not capable of merging multiple expressions into one expression (at least not for all expressions), e.g. the expression max(someData[keep itsObservation] group by itsIdentifier)[rename itsObservation to maximumOfTheObservation]; is not covered by the grammar although the individual expressions are covered, i.e. firstStep := someData[keep itsObservation];, secondStep := max(firstStep group by itsIdentifier); and result := secondStep[rename itsObservation to maximumOfTheObservation]; is valid according to the grammar (and the Reference Manual).
(b) From a readability perspective it may prove disadvantageous to represent multiple expressions as only one statement / line.
Please note that, by convention, the resulting object in a multi-line user defined by the last object named RSLT.
Reusing names in a module / in the transformation schemes
According to the VTL specification names in a Transformation scheme need to be unique in order to build an oriented graph. For documentation purposes this constraint seems rather burdensome because it would force us to apply certain naming conventions on the transformations used in a Module (e.g. by adding an integer to every data set used in a Module). We believe that reusing names is a more flexible approach (think about an amendment of the first Transformation of a Module comprising multiple Transformations) and the fact that a Module comprises an order list of Transformations allows us to generate an oriented graph structures as well.
Reusing names for variables (in an expression)Reusing names for variables (in an expression)
The argumentation for this inconsistency with the VTL specification is similar to the one described above.
Omitting the Perspective identifier and the Reference date in join operations (except in cases where they are relevant for business purposes)
Because the Perspective identifier and the Reference date act as Dimensions (i.e. being part of the Primary Key) for most of the Cubes in the BIRD Input Layer we omit using them in the Transformations except in cases where these variables are relevant for business purposes, e.g. when selecting a specific Perspective or a particular Reference date. The rationale underlying to this convention is that there is no additional value added in listing them in the Transformations explicitly.
Amendments to the grammar
Please take a look at the grammar in our github page to get an overview of all amendments made to the official grammar.
(a) comments allowed for individual objects, e.g. data sets, if operators, else operators
We added our own token for comments (so not to interfere with the official comment tokens) and
allowed in front of if, else operators and in front of variable identifiers (varID).
(b) added labels for parser rules
In order to process expressions in a more robust way we added labels for some parser rules (see for example Vtl.g4, line 24 expr parser rule).
(c) added a parser rule for calling user defined operators
Transformation package structure
The transformation package in the dictionary is comprised by the following objects:
(i) Modules
(ii) Transformations
(iii) Functions
(iv) Transformation nodes
(v) Transformation node relationships
(vi) Transformation schemes
A Module is an ordered set of Transformations specifying manipulation of data based on (a) business requirements and / or (b) technical necessities.
A business requirement may be represented by stating that the Type of amortisation for Reverse repurchase loans is Bullet (4) by default (see P_IMPLCT_RVRS_RPRCHS_LNS). A technical necessity may be expressed by the fact that in order to enrich a loan with it’s associated protection we first need to enrich it with the linking table that establishes the many-to-many relationship between a loan and a protection (“one loan might be linked to many protections while one protection may be used for multiple loans”).
Each Module is assigned to a phase in the BIRD process that allows us to identify it’s location between the Input Layer (IL) and the Reference / Non-reference Output Layer (ROL / NROL). Additionally each Module is assigned to a type specifying – in rather broad terms – its purpose.
A Transformation is a statement expressed in VTL; please note take a look at the Conventions used in the BIRD transformation rules / deviations from the VTL specification to get a clear picture about the utility of VLT in the BIRD. Each transformation has one output and one or many inputs which may be identified by exploring the Transformation nodes associated to a particular Transformation.
Each Transformation is decompsed into its Transformation nodes, i.e. the elementary objects used to describe the expression. E.g. the expression y := k * x + d consists of 7 Transformation nodes where y, k, x and d represent Data sets and :=, * and + represent operators. Each operator has a predefined structure, e.g. in the case of the assignment operator (i.e. :=) there exists a left hand side and a right hand side; the same for the multiplication (i.e. *) and addition (i.e. +) operator. Note that most operators have a more complex structure than binary operators. Please also note that the defined structure for each operator was choosen arbitratily by the person providing the parser for VTL syntax and is not part of the official VTL specification. Because of the structure assigned to each operator node the Transformation nodes associated with a Transformation may be represented in a tree structure (i.e. a parent-child relationship between Transformation nodes where each Transformation node has at most one parent). Such a tree structure (also called Abstract Syntax Tree structure) may prove useful for technology independent implementation.
The Transformation node relationships represent that relationships between nodes that are not covered by the Abstract Syntax Tree. These relationships reflect the association between the Transformation nodes of different Transformations, i.e. the input and output parameters of each Transformation. E.g. the Transformations (T1) s := k * x and (T2) y := s + d; the Transformation node relationship of the Transformation nodes comprised in these two Transformations would specify that the data set s in the first Transformation (let’s call it T1.s) is used in the second Transformation (let’s say T2.s), i.e. T1.s -> T2.s. In this way the Transformation package comprises the data lineage of a data set in the BIRD process.
A Transformation schemes can be considered as a VTL program, i.e. a set of Transformations that are run together. The Transformation schemes listed in the Transformation package are the result of of the oriented graph structure for each Output Cube. Each Transformation scheme comprises the Transformations of the involved Modules.
Example
The following example will try to clarify the current status of the representation of transformations in the BIRD database:
Let’s assume we have a (database-)table named “coordinates” containing the columns (i.e. variables) x and y which (clearly) relate to some coordinate system. Our transformation scheme’s goal is to derive a new variable distance for all records where x and y are greater than or equal to 0 defined in the following way:
distance = sqrt(x * x + y * y).
Using VTL syntax we would write the following lines:
/*extract all records from the (database-)table coordinates and store the result in a dataset named “coordinates”*/ coordinates := get(“coordinates”); /*extract all records from the dataset “coordinates” where x and y are greater or equal to zero, keep only x and y and store the result in a dataset named “result”*/ result := coordinates [filter (x >= 0 and y >= 0), keep (x, y)]; /*apply a calculation on the dataset “result” which takes the square root of the sum of x squared and y squared and stores the result in a new column (i.e. variable) named “distance”*/ finalResult := result [calc sqrt (x * x + y * y) as “distance”]; |
The tree structure with respect to the second line can be illustrated as follows:
result := coordinates [filter (x >= 0 and y >= 0), keep (x, y)];
The term written in brackets is the type of transformation element which can be used to identify constants and BIRD model objects (i.e. variables, cubes, etc.).
Please note that the Boolean condition applied to the filter operator (i.e. “x>=0 and y>=0”) is completely decomposed into its components (i.e. transformation elements) in a structured way in the sense that the Boolean condition can be reengineered from this tree structure.
The decomposition of transformation into its transformation elements supports specific implementations of these transformations. For example in case of a SQL implementation we could apply the following mappings:
- := CREATE VIEW _______ AS
- Filter WHERE
- Keep SELECT
Walk the tree and create the corresponding line of SQL code:
CREATE VIEW result AS SELECT x, y FROM coordinates WHERE x >= 0 AND y >= 0; |
Please note that – in order to generate such an SQL statement – one must additionally rearrange the nodes of the tree according to the SQL syntax. Please also note that the elements after each keyword (i.e. CREATE VIEW, SELECT, FROM, WHERE) are similar to the elements represented in the tree structure.
For the sake of completeness you find the tree representation of the first and second line here:
coordinates := get(“coordinates”);
finalResult := result [calc sqrt(x * x + y * y) as “distance”];
Representation in the database
The transformation scheme is stored in the table TRANSFORMATION_SCHEME:
SCHEME_ID | EXPRESSION | DESCRIPTION | NATURAL_LANGUAGE |
---|---|---|---|
TEST_SCHEME | /*extract all records from the (database-)table coordinates and store the result in a dataset named “coordinates”*/ coordinates := get(“coordinates”); /*extract all records from the dataset “coordinates” where x and y are greater or equal to zero, keep only x and y and store the result in a dataset named “result”*/ result := coordinates [filter (x >= 0 and y >= 0), keep (x, y)]; /*apply a calculation on the dataset “result” which takes the square root of the sum of x squared and y squared and stores the result in a new column (i.e. variable) named “distance”*/ finalResult := result [calc sqrt (x * x + y * y) as “distance”]; |
Transformation scheme example | Transformation scheme deriving the distance between x and y |
Please note that this is a reduced version of the original table, presented for illustrative purposes.
Each individual transformation is stored in the TRANSFORMATION table:
TRANSFORMATION_ID | EXPRESSION | SCHEME_ID | ORDER |
---|---|---|---|
156430 | /*extract all records from the (database-)table coordinates and store the result in a dataset named “coordinates”*/ coordinates := get(“coordinates”); |
TEST_SCHEME | 0 |
156435 | /*extract all records from the dataset “coordinates” where x and y are greater or equal to zero, keep only x and y and store the result in a dataset named “result”*/ result := coordinates [filter (x >= 0 and y >= 0), keep (x, y)]; |
TEST_SCHEME | 1 |
156451 | /*apply a calculation on the dataset “result” which takes the square root of the sum of x squared and y squared and stores the result in a new column (i.e. variable) named “distance”*/ finalResult := result [calc sqrt (x * x + y * y) as “distance”]; |
TEST_SCHEME | 2 |
Using the SCHEME_ID we can connect these transformations with the related transformation scheme (which is similar to stating that “these transformations are children of the transformation scheme with SCHEME_ID ”TEST_SCHEME”).
All Transformation elements are stored in the table TRANSFORMATION_NODE:
TRANSFORMATION_ID | NODE_ID | EXPRESSION | TYPE_OF_NODE | LEVEL | PARENT | ORDER |
---|---|---|---|---|---|---|
156430 | 156431 | := | OperatorNode | 0 | ||
156430 | 156432 | coordinates | ReferenceNode | 1 | 156431 | 0 |
156430 | 156433 | get | OperatorNode | 1 | 156431 | 1 |
156430 | 156434 | “coordinates” | ReferenceNode | 2 | 156433 | 0 |
156435 | 156436 | := | OperatorNode | 0 | ||
156435 | 156437 | result | ReferenceNode | 1 | 156436 | 0 |
156435 | 156438 | keep | OperatorNode | 1 | 156436 | 1 |
156435 | 156439 | filter | OperatorNode | 2 | 156438 | 0 |
156435 | 156440 | coordinates | ReferenceNode | 3 | 156439 | 0 |
156435 | 156441 | and | OperatorNode | 3 | 156439 | 1 |
156435 | 156442 | >= | OperatorNode | 4 | 156441 | 0 |
156435 | 156443 | x | ReferenceNode | 5 | 156442 | 0 |
156435 | 156444 | 0 | ConstantNode | 5 | 156442 | 1 |
156435 | 156445 | >= | OperatorNode | 4 | 156441 | 1 |
156435 | 156446 | y | ReferenceNode | 5 | 156445 | 0 |
156435 | 156447 | 0 | ConstantNode | 5 | 156445 | 1 |
156435 | 156448 | Parameters | OperatorNode | 2 | 156438 | 1 |
156435 | 156449 | x | ReferenceNode | 3 | 156448 | 0 |
156435 | 156450 | y | ReferenceNode | 3 | 156448 | 1 |
156451 | 156452 | := | OperatorNode | 0 | ||
156451 | 156453 | finalResult | ReferenceNode | 1 | 156452 | 0 |
156451 | 156454 | calc | OperatorNode | 1 | 156452 | 1 |
156451 | 156455 | result | ReferenceNode | 2 | 156454 | 0 |
156451 | 156456 | as | OperatorNode | 2 | 156454 | 1 |
156451 | 156457 | sqrt | OperatorNode | 3 | 156456 | 0 |
156451 | 156458 | + | OperatorNode | 4 | 156457 | 0 |
156451 | 156459 | * | OperatorNode | 5 | 156458 | 0 |
156451 | 156460 | x | ReferenceNode | 6 | 156459 | 0 |
156451 | 156461 | x | ReferenceNode | 6 | 156459 | 1 |
156451 | 156462 | * | OperatorNode | 5 | 156458 | 1 |
156451 | 156463 | y | ReferenceNode | 6 | 156462 | 0 |
156451 | 156464 | y | ReferenceNode | 6 | 156462 | 1 |
156451 | 156465 | “distance” | ConstantNode | 3 | 156456 | 1 |
This structure supports easy access to the components of each Transformation. If, for example, we are interested in the operators that are used in the second line (result := coordinates [filter (x >= 0 and y >= 0), keep (x, y)]; compare FIG 1) we simply select all rows where the TRANSFORMATION_ID equals 156435 and restrict the result to those records where the TYPE_OF_NODE equals OperatorNode. The result reflects the blue squares in FIG 1.
Please note that not only Transformations are represented in this structure but also Functions, Datasets and Procedures.
Transformation parser
We developed a parser for VTL in order to visualize tree structures and support the production of the output data model required for the BIRD database. The parser is written in Java, and available in GitHub.
Dependencies of transformation schemes
Due to the fact that most of the transformation schemes depend on other schemes in the sense that they use datasets that are generated in other schemes we provide a graphical illustration of such dependencies for each transformation scheme following the header “Scheme dependencies” in the “Natural language” section. These dependencies can be computed from the transformation content in the database (i.e. the tables TRANSFORMATION_SCHEME, TRANSFORMATION, TRANSFORMATION_NODE).
Please note that we implemented some restrictions to the transformation schemes that are taken into account when computing these dependencies; first we do not include validation and put schemes in the dependency tree and second, such a dependency tree does not contain any duplication of related schemes (although multiple schemes in such a tree may depend on the same transformation scheme).