Handling HTML With Drupal's Migrate API

Benji Fisher // July 2019

In Drupal 8, we use the core Migrate API for

  • Upgrading Drupal 6 and Drupal 7 sites
  • Migrating sites from other systems to Drupal
  • Recurring imports from external systems (feeds)

It is a robust, flexible tool.

Drupal works best with structured data, and the Migrate API supports this: file attachments, related taxonomy terms, references to authors or other nodes, and so on. Along with the structured data, we also have to deal with blocks of text, and these blocks often contain HTML markup.

Until now, the Migrate API has supported basic processing of text fields using regular expressions. Marco Villegas and I contributed some plugins to the Migrate Plus module to support proper HTML parsing. This is easier to use and more reliable than using regular expressions.

We originally wrote these plugins while working for Isovera on a project for Pega Systems. Both Isovera and Pega have supported sharing these plugins with the Drupal community. I hope other developers will use them and give back some of their own plugins that use the same approach.

Parsing HTML versus processing with regular expressions

Suppose you need to extract the URL from a bit of HTML markup like this:

<a href="https://www.drupal.org">Drupal home page</a>

Using regular expressions

If you are familiar with regular expressions, then it is pretty easy to come up with something like /<a href="([^"]+)">/ and use it with built-in PHP functions like preg_match().

Unfortunately, it is more complicated than that. For example:

  • The HTML tags are case-insensitive: you have to match a or A.
  • There might be other attributes, such as class, id, or name, before or after the href attribute.
  • The URL (value of the href attribute) might be enclosed in single quotes instead of double quotes.
  • There might be newlines within the HTML element.
  • Are you sure that an escaped quote (like \") is not allowed in a URL?

Before you start researching that last question, the point is that you should not spend your time reinventing the wheel.

There is an amusing answer on StackOverflow describing the dangers of trying to process HTML with regular expressions, and this practice has come to be known as Parsing Html The Cthulhu Way. The StackOverflow answer ends with the suggestion,

Have you tried using an XML parser instead?

Using the DOMDocument class

In PHP, we can use the DOMDocument and related classes to parse HTML markup. These classes use an HTML parser in the background rather than regular expressions. There are some steps to set things up:

$document = new \DOMDocument();
$document->loadHTML($html_string);
$xpath = new \DOMXPath($document);

After this bit of boilerplate code, we can search the $xpath object with any XPath query and extract whatever attributes we need. For example, to find the href attribute of each <a> element in the source,

foreach($xpath->query('//a') as $html_node) {
  $href = $html_node->getAttribute('href');
  // Your processing goes here.
}

Using XPath queries gives us a lot of flexibility: we can find <a> elements having a specific class, or we can select those that are nested inside some other HTML element. We did not even think about these possibilities when discussing regular expressions above.

When you are finished processing your DOMDocument element, you can convert it back to a string:

$processed_html = $document->saveHTML();

Migrate API and the ETL paradigm

In Drupal 8, the Migrate API follows the standard Extract, Transform, Load (ETL) structure, and we also keep the terminology from the contributed Migrate module in earlier versions of Drupal:

  • Extract (source plugin): read data from the source
  • Transform (process plugins): change data to match the site’s structure
  • Load (destination plugin): save the data

Each migration has a single source plugin and a single destination plugin, but each field uses at least one process plugin and may use several. I think this is the fun part: creating new, easy-to-configure process plugins is the best way to add reusable code to the framework.

The Transform/process phase is also the right place to handle HTML processing.

New process plugins for managing HTML

So far, Marco and I have contributed four process plugins to the Migrate Plus module. The goal of these plugins is to make it easy to process text fields with proper HTML parsing. The plugins create the required DOMDocument and related objects, so the person writing the migration only has to supply the XPath expression and other configuration.

The dom plugin

This plugin handles creating the DOMDocument object from a string, and then converting back to a string at the end. The other plugins go between these two steps, so they take a DOMDocument object as input, do some processing on it, and return the same object. This is what it looks like in practice:

process:
  'body/value':
    -
      plugin: dom
      method: import
      source: 'body/0/value'
    # Other plugins do their work here.
    -
      plugin: dom
      method: export

The dom_str_replace plugin

Suppose, as part of your site upgrade, you decide to change the subdomain. For example, you might decide to change documentation.example.com to help.example.com. If you have any links in your text fields, then you need to update them. You can do this with the dom_str_replace plugin:

    -
      plugin: dom_str_replace
      mode: attribute
      xpath: '//a'
      attribute_options:
        name: href
      search: 'documentation.example.com'
      replace: 'help.example.com'

Warning: The xpath key was called expression in version 8.x-4.2 of the Migrate Plus module. Use xpath starting with the recently released version 8.x-5.0-rc1.

Like the str_replace plugin that is already part of the Migrate Plus module, this plugin supports either basic string replacement, using the PHP str_replace() or str_ireplace() function, or regular expressions, using preg_replace().

The dom_apply_styles plugin

If you are using the Migrate API to import data from an external source, then you want the imported data to have formatting consistent with the rest of your site. Perhaps you have configured Drupal’s Editor module to add certain CSS classes from the Styles menu of the WYSIWYG editor, but you cannot add those classes to the external source.

This plugin lets you search for an XPath expression and replace the corresponding HTML elements with whatever is configured in the Editor module. For example,

    -
      plugin: dom_apply_styles
      format: full_html
      rules:
        -
          xpath: '//b'
          style: Bold

This will replace <b>...</b> with whatever style is labeled “Bold” in the Full HTML text format, perhaps <strong class="normal-size">...</strong>.

The dom_migration_lookup plugin

If you are migrating from a Drupal 7 site, then perhaps node/123 on the old site becomes node/456 on the new site. If you have entity-reference fields, then you can update references like these using the migration_lookup plugin from the core Migrate module.

If those references are in links in a text field, then you can now use the dom_migration_lookup plugin:

    -
      plugin: dom_migration_lookup
      mode: attribute
      xpath: '//a'
      attribute_options:
        name: href
      search: '@/node/(\d+)@'
      replace: '/node/[mapped-id]'
      migrations:
        - article
        - page

If either the article or page migration has mapped 123 to 456, then this will replace /node/123 in any href attributes with /node/456.

Like the core migration_lookup plugin, this one violates the strict ETL paradigm, since a process plugin (i.e., code in the Transform stage) has to “peek” at the destination database. Ditto for the dom_apply_styles plugin, which reads configuration from the destination database.

References