Migrating Your Data from D7 to D10: The migration process pipeline

Series Overview & ToC | Previous Article | Next Article

Our last article explored the syntax and structure of migration files. Today, we are diving deeper into the most important part of a migration: the process pipeline. This determines how source data will be processed and transformed to match the expected destination structure. We will learn how to configure and chain process plugins, how to set subfields and deltas for multi-value fields, and to work with source constants and pseudo-fields.

Let’s get started.

From source to destination

The process section in a migration is responsible for transforming data as extracted from the source into a format that the destination expects. The collection of all those data transformations is known as the migration process pipeline. The Migrate API is a generic ETL framework. This means the source data can come from different types of sources like a database table; a CSV, JSON, or XML file; a remote API using JSON:API or GraphQL; or something else. The destination can be as diverse including databases, text files, and remote APIs. Because the series focuses on migrating from Drupal 7 to 10, most of our discussion will revolve around reading the Drupal 7 database and writing Drupal 10 entities.

Most of the source plugins we are going to see during the series extend the SqlBase abstract class, which requires implementing a query method. When migrating from Drupal 7, source plugins will use this method to fetch columns from one or multiple tables. For example, the d7_node source plugin fetches the latest node revision data from the node and node_revision Drupal 7 tables. It also handles node translations and exposes a node_type configuration option to limit which content types to fetch nodes from. The columns that the source plugins return become available in the process section. Source plugins can also implement a fields method to document which fields are available. You can use the drush migrate:fields-source [MIGRATON_ID] command to see a list of field names and their description for a given migration ID.

Note: The drush migrate:fields-source command is calling the fields method, not the query method. It is possible that extra data might be fetched even if not reported by the command. To get the complete picture of what is available you can use the Migrate Devel module or debug the migration import process. We are going to learn about debugging migrations in a future article.

Source plugins can optionally implement a prepareRow method to add additional data to each row being processed. In many cases, this comes from fetching data from other Drupal 7 tables. This is how the d7_node source plugin gets field API data attached to the content type. Field attachments are not reported by the migrate:fields-source drush command. The prepareRow can also be used to skip the whole row completely based on a specified condition.

This brings up a very important point about the migration process. Each element to import is fetched, transformed, and stored one at a time. In the case of nodes, each individual node is fetched from Drupal 7, the data transformed by the process pipeline, and the results sent to the destination plugin to create a Drupal 10 node (content) entity.

What if there is data you want or need to exclude from the migration? Fortunately, there are multiple methods to filter out data from a migration. A condition on the query method is ideal because the records are filtered out before any data related to them is fetched. If not possible, try to use prepareRow. In this case, Drupal 7 will be queried to get source data, but the record is skipped before sending it to the process pipeline. You can also use process plugins to skip records. Doing so as late as the destination plugin is also possible, but generally not necessary. When working through the examples, we will present multiple examples of filtering out Drupal 7 data.

Once the data is fetched from Drupal 7, it has to be transformed into a format that destination plugins can use to create Drupal 10 configuration and content entities. This is where we are going to spend most of the time when writing migrations. In some cases, we can copy data verbatim like a node title. Sometimes changes are necessary because the underlying storage model changed. For example, a date that was stored as a UNIX timestamp in Drupal 7 and now uses a string representation in the Drupal 10 database. When making changes to the content model, we might have to figure out a way to map old data to new one. This can happen at various levels: within a single field, within a single entity, or across multiple entities. Throughout the series we will present multiple examples of these types of transformations.

It is important to highlight that during the process pipeline, Drupal does not yet know if the entity can be created or not. This is especially true when entity validation is enabled in the destination plugin for content entities. Our job is to apply the necessary transformations to produce a valid representation of an entity. That is why it is imperative to have a good understanding of entities, their base properties, and the fields attached to them. We talked about this in article 7.

After all the transformations, the destination plugin grabs the data sent by the process pipeline and performs an import operation. In most cases, this means forwarding the data to the Entity API for it to create a configuration or content entity. As a result, we are not writing to Drupal 10 database tables directly but letting Drupal handle the operation based on existing APIs. Additionally, any hook or event that responds to entity actions will be triggered and executed.

While highly discouraged, it is possible to use destination plugins that write directly to database tables. We are not going to cover this as part of the series.

Process plugins

To make the necessary transformations, you need to apply one or more process plugins. Drupal core comes with many general purpose process plugins. It also ships with plugins aimed to assist with Drupal 7 to 10 migrations. The Migrate Plus module also offers very useful process plugins which we will use throughout the series.

Other modules that provide process plugins include migrate_conditions, migrate_process_extra, and migration_tools (not to be confused with the Migrate Tools module). It is important to know what transformations are possible with the process plugins that already exist so take time to review those resources.

That being said, it is quite common to write custom process plugins during Drupal 7 to 10 migrations. This approach will also be covered later in the series.

Note: Explaining what each process plugin does is beyond the scope of this series. To learn more, refer to the official documentation or other guides like the 31 days of migration series. When working on the examples, we will provide context and explanation for the plugins used in customizing the migrations.

Migration code snippets

Now let's review examples to understand how process plugins are configured.

process:
  nid:
    -
      plugin: get
      source: tnid

In this snippet, we are using the get process plugin, which expects a configuration option named source. The get plugin is used to make a verbatim copy of a source field as fetched by the source plugin. Here we are copying the tnid source field into the nid destination property. As we learned in the previous article, a dash (-) in YAML is used to represent an element of an array. It is possible, and common, to use multiple process plugins in order to make the necessary transformation on a single source field. When only a single transformation is needed, this could be simplify by removing the dash:

process:
  nid:
    plugin: get
    source: tnid

The plugin configuration is added as a direct child of destination property. Because get is a very common operation, the Migrate API considers it the default process plugin to apply. The above snippet can be further simplified to:

process:
  nid: tnid

This is syntactic sugar provided by the Migrate API. That is, alternative ways to write a pipeline with the objective of making it easier to read. When writing migrations, prioritize readability as that will help others, including your future self, to understand what the migration is doing. This extends to how the transformations are applied. If you need to use multiple process plugins and the pipeline gets long or hard to understand, consider writing a custom source plugin instead.

Each plugin offers a different set of configuration options. Some might be required while others are optional.

process:
  langcode:
      plugin: default_value
      source: language
      default_value: und
      strict: TRUE

The default_value process plugin checks if the specified source property contains a value. If not, whatever is indicated in the default_value configuration option is returned. An optional strict configuration option can be provided. When set to TRUE, the default is applied when input value is NULL. When set to FALSE, the default is applied when input contains a truthy PHP value. If not specified, strict is considered to be FALSE.

The two examples above populate entity base properties, which store a single value. When working with fields attached to a content entity, it is possible for a field to store more than one piece of data. Let's see how we can migrate a Text (formatted, long) field:

process:
  field_description/value:
    -
      plugin: get
      source: description
  field_description/format:
    -
      plugin: get
      source: format

A Text (formatted, long) field stores two pieces of data: the text to show (the value property) and the text format to apply when rendering it (the format property). Each piece of data a field stores is referred to as a subfield. In this case we have two but other field types can have more. For example, link fields have three subfields: uri, title, and options. And the last one stores a serialized array of options for the link.

It is important not only to know which subfields are available but also understand in which format we need to provide the data. In this case, it has to be a PHP serialized array. It would be impractical to list all subfields for every possible field. Refer to this reference of sub-fields for some common field types.

Going back to our example, the syntax to use is [FIELD_NAME]/[SUB_FIELD_NAME] followed by the process pipeline necessary for the specified subfield. Note that some field types have a default subfield. When that is the case and you only want to set the default subfield, you can omit the [SUB_FIELD_NAME] part of the assignment:

process:
  field_description:
    -
      plugin: get
      source: description
  field_description/format:
    -
      plugin: get
      source: format

The above is a contrived example equivalent to the previous code snippet. Again, favor readability and clarity. If the above seems confusing, it is better to be verbose and indicate the subfield even when mapping the default one.

Technical note: The list of subfields available correspond to database columns used by the field type. They are listed in the schema method of the class that implements the field type. The default subfield is provided by the mainPropertyName method of the same class. For some field types, like double_field where there is no clear primary subfield, the mainPropertyName method will return NULL.

Another thing to consider with fields attached to content entities is the possibility of them having multiple values. Let's consider the following example for combining multiple link fields in Drupal 7 to a single multi-value link field in Drupal 10:

process:
  field_online_profiles/0/uri: field_drupal/0/url
  field_online_profiles/0/title:
    plugin: default_value
    default_value: 'Drupal.org profile'
  field_online_profiles/1/uri: field_gitlab/0/url
  field_online_profiles/1/title:
    plugin: default_value
    default_value: 'GitLab profile'
  field_online_profiles/2/uri: field_github/0/url
  field_online_profiles/2/title:
    plugin: default_value
    default_value: 'GitHub profile'

The syntax for mapping multi-value fields is [FIELD_NAME]/[DELTA]/[SUB_FIELD_NAME] followed by the process pipeline, where [DELTA] is an integer starting in zero indicating the position of the element in the multi-value collection. Notice that the same syntax applies for extracting data from Drupal 7 source fields as to assigning data to Drupal 10 destination properties. If no subfield is specified, the default subfield is assumed as explained before. In the previous example, if there is no need to set the link title, the transformation can be simplified to:

process:
  field_online_profiles/0/uri: field_drupal/0/url
  field_online_profiles/1/uri: field_gitlab/0/url
  field_online_profiles/2/uri: field_github/0/url

Having to specify the delta manually like this is not very flexible. The sub_process process plugin is often used for handling the assignment of multivalue fields, optionally setting more than one subfield in the same operation. For example:

process:
  field_image:
    -
      plugin: sub_process
      source: field_image
      process:
        target_id: fid
        alt: alt
        title: title
        width: width
        height: height

Process plugin chains

In the code snippets presented so far, only one process plugin has been used for each destination property. Some transformations are more complex than others and using multiple plugins in a chain is a common practice. Chaining of process plugins works similarly to Unix pipelines in that the output of one process plugin becomes the input of the next one in the chain. When the last plugin in the chain completes its transformation, the return value is assigned to the destination property.

process:
  source_full_path:
    -
      plugin: concat
      delimiter: /
      source:
        - constants/source_base_path
        - filepath
    -
      plugin: urlencode

In this example, the return value of the concat process plugin becomes the source for the urlencode plugin. Then, the output of urlencode is assigned to the source_full_path destination property. Most of the time, only the first plugin in the chain will have a source configured. For the rest, it is assumed that the output of the previous plugin should be the source. That being said, the migrate API will honor a source configuration in subsequent plugins in a process chain. An example would be using the skip_on_empty process plugin to check the presence of a field, but then do the transformation based on a different source field.

process:
  field_home_address:
    -
      plugin: skip_on_empty
      source: field_is_home_address/0/value
      method: process
    -
      plugin: addressfield
      source: field_address

Here we are checking if a boolean field_is_home_address is set to a non-empty value. In Drupal 7, a TRUE value will be stored as the integer 1, while a FALSE value will be saved as the integer 0. When the source field is TRUE, skip_on_empty allows the pipeline to continue, passing the 1 value to addressfield. But then, addressfield reads a different source field named field_address and that transformation is assigned to the field_home_address destination property.

Source constants and pseudo-fields

In the Migrate API, source constants are arbitrary values defined in the source section. Pseudo-fields are named variables defined in the process section to store the result of a plugin chain. In both cases, they can be used after being defined in the process pipeline. Let's look at an example:

source:
  key: migrate
  plugin: d7_file
  scheme: public
  constants:
    source_base_path: 'http://ddev-migration-drupal7-web/'
process:
  source_full_path:
    -
      plugin: concat
      delimiter: /
      source:
        - constants/source_base_path
        - filepath
    -
      plugin: urlencode
  uri:
    -
      plugin: file_copy
      source:
        - '@source_full_path'
        - uri

In this case, we are creating a source constant named source_base_path. This is used as one of the source arguments sent to the contact plugin. In the example, source_full_path is not an entity base property nor a field attached to the entity bundle. Therefore, it is considered a pseudo-field which stores the output of the process plugin chain composed of the concat and urlencode plugins. To reference the value of a pseudo-field, you put an at sign (@) and enclose it quotes (').

When expanding the source configuration for a process plugin, if the name that starts with @ it is read from the process section. Otherwise, it is read from the source section. A pseudo-field should appear in the process pipeline before it can be used. This is similar to how the rewrite results feature works in Views. Like pseudo-fields, you can also reference other destination properties using the @ prefix as long as they have been defined in the process pipeline. In the snippet below, we reuse the mail property to assign the value of the init property.

process:
  mail:
    -
      plugin: get
      source: field_email/0/email
  init: '@mail'

As we have explored in this article, the Migrate API is flexible and impressive. Out of the box, you are able to write complex data transformations with process plugins. And when needed, writing custom process plugins is a straightforward process. However, remember to strive for writing process pipelines that are easy to read, understand, and maintain. Stay tuned as we build on these foundations and continue to explore the migration process.

Image by Adina Voicu from Pixabay

Migration How-to: #13

Migrating Your Data from D7 to D10: The migration process pipeline

Mauricio Dinarte

Senior Software Engineer | Drupal Migrations Expert

From source to destination

Process plugins

Migration code snippets

Process plugin chains

Source constants and pseudo-fields

More Migration Resources

Performance testing with Gander

Popular content

Popular blogs