Importing an Atom Feed with the Drupal 8 Migrate API and Paragraphs (Part 2)

Benji Fisher // January 2018

I recently worked on a Drupal 8 site where we had to create blog posts (nodes) from an Atom feed that updates periodically. In earlier versions of Drupal, this would have been a job for the Feeds module, but in Drupal 8 we use the core Migrate API instead.

Here are some of the features of this project.

  • Handle an Atom feed with the Migrate framework.
  • Download images and create File and Media entities in Drupal.
  • Skip some entries based on a custom field.
  • Split the main body into separate image and text paragraphs.

I covered the first three items in Part 1 of this post. This post is about the last point, which is the most fun! This can also be useful when migrating from Drupal 7 or another legacy system.

I decided to use a proper HTML parser for the job. Everyone knows that parsing HTML with regular expressions is a bad idea, but too often we do it anyway.

There are several code snippets in this post. The full code is available at https://github.com/isovera/atom_migrate.

Parsing HTML and creating Paragraph entities

The feed we are consuming has the main body in a single HTML blob. We have structured the blog content type using the Paragraphs module, so we need to split that blob up into pieces. This gives us more control over how the articles are formatted: it is much easier to lay out the content responsively if we have more structure than a blob.

Luckily, the source feed for this project is pretty consistent. The images are contained in <p> elements with no other text, and those <p> elements are not contained in any other HTML tag. I think the tools I am using could be adapted to more complex markup, but I am happy to start with something simple.

Since I am creating multiple entities from each “row” of the input, this is not a good candidate for a preliminary migration, as I did with the “featured image” in Part 1. Instead, I use a custom process plugin that takes an HTML blob as input, creates one or more Paragraph entities, and returns an array of “Paragraph references” as output.

Why did I put “Paragraph references” in quotes? I am glad you asked!

Paragraphs are full-fledged Drupal entities, but a “Paragraph reference” field on a node (or other entity) is not a standard entity reference. It is an entity reference revision (ERR), provided by the Entity Reference Revisions module.

As far as the migration is concerned, all this means is that we need two numbers for each field item. An entity reference just needs a single target_id, but an ERR needs an array with two elements, keyed by target_id and target_revision_id.

The other point worth making is that I use the DOMDocument and related classes to parse the HTML blob. This uses a true HTML parser, so it is more reliable than regular expressions for splitting the blob up into chunks.

Defining a migration process plugin

As I said above, the custom plugin takes a string (HTML blob) as input, creates Paragraph entities, and returns an array of arrays. I reference it in my migration like this:

File atom_migrate/config/install/migrate_plus.migration.blog_node.yml (excerpt)

process:
  field_paragraph:
    plugin: split_into_paragraphs
    source: description

The plugin ID split_into_paragraphs is defined in the plugin’s docblock, following the pattern for the Drupal 8 plugin system. Here is the full docblock, including documentation and a usage example:

File atom_migrate/src/Plugin/migrate/process/SplitIntoParagraphs.php (excerpt)

/**
 * Split an HTML blob into paragraphs.
 *
 * This version assumes that the HTML is a sequence of <p> elements. If the <p>
 * tag wraps a single <img> element, then create a Media paragraph. Otherwise,
 * create a Text Area paragraph.
 *
 * Return an array of arrays. The inner arrays are keyed by 'target_id' and
 * 'target_revision_id', suitable for passing into a Paragraph field.
 *
 * Example:
 *
 * @code
 * process:
 *   field_paragraph:
 *     plugin: split_into_paragraphs
 *     source: html_blob
 * @endcode
 *
 * @MigrateProcessPlugin(
 *   id = "split_into_paragraphs"
 * )
 */
class SplitIntoParagraphs extends ProcessPluginBase {
  // ...
}

 

The main loop for splitting up the blob

The only method that a process plugin has to implement is transform(). The Migrate API will pass the input field (or intermediate results from the process pipeline) to this method.

This is where we initialize the DOMDocument object, and then have a simple loop to split up the input. All the hard work is deferred to other methods (see below).

File atom_migrate/src/Plugin/migrate/process/SplitIntoParagraphs.php (excerpt)

  /**
   * {@inheritdoc}
   */
  public function transform($value, MigrateExecutableInterface $migrate_executable, Row $row, $destination_property) {
    $paragraphs = [];
    $post = new DOMDocument();
    // I hope that $value is always a string.
    $post->loadHTML($value, static::DOM_OPTIONS);

    $html = $post->getElementsByTagName('body')->item(0);
    $current = '';
    foreach ($html->childNodes as $child) {
      $fragment = $post->saveHTML($child);
      // If the current node contains an image tag, then push $current onto the
      // output. Assume there is nothing else interesting in the current node,
      // and push the image onto the output.
      if (!static::hasDescendantTag($fragment, 'img')) {
        $current .= $fragment;
      }
      else {
        if (strlen($current)) {
          $paragraphs[] = static::createTextParagraph($current);
          $current = '';
        }
        $paragraphs[] = static::createMediaParagraph($fragment);
      }
    }
    if (strlen($current)) {
      $paragraphs[] = static::createTextParagraph($current);
    }

    return $paragraphs;
  }

Again, this is simple because the markup from our feed is simple. Each direct child of the <body> element is either text or image, so we either create a text paragraph or an image paragraph. The DOMDocument parser can handle more complicated markup, but this is a good place to start.

The main loop uses this helper function:

File atom_migrate/src/Plugin/migrate/process/SplitIntoParagraphs.php (excerpt)

  /**
   * Check whether an HTML blob includes an given tag.
   *
   * @param string $blob
   *   The HTML string to check.
   * @param $tag
   *   The tag name.
   *
   * @return bool
   *   Return TRUE if the blob contains the tag $tag.
   */
  static protected function hasDescendantTag($blob, $tag) {
    // Even when $blob comes from a DOMNode object, I could not get this to work
    // using DOMXpath::query().
    $dom = new DOMDocument();
    $dom->loadHTML($blob, static::DOM_OPTIONS);
    $nodes = $dom->getElementsByTagName($tag);
    return ($nodes->length != 0);
  }

Creating a text paragraph

Since Paragraphs are standard Drupal entities, we can add new ones using the create() and save() methods. This code uses the type (“text_area”) and field name (“field_text_area”) configured for this site.

File atom_migrate/src/Plugin/migrate/process/SplitIntoParagraphs.php (excerpt)

  /**
   * Create a Text Area paragraph.
   *
   * @param string $blob
   *   The HTML string to use as the main text field.
   *
   * @return int[]
   *   An array of entity/revision IDs keyed by 'target_id' and
   *   'target_revision_id'.
   */
  static protected function createTextParagraph($blob) {
    $paragraph = Paragraph::create([
      'type' => 'text_area',
      'field_text_area' => [
        'value'  =>  $blob,
        'format' => 'basic_html'
      ],
    ]);
    $paragraph->save();

    return [
      'target_id' => $paragraph->id(),
      'target_revision_id' => $paragraph->getRevisionId(),
    ];
  }

Note the return value: the array with two entries is what the paragraph field needs.

Creating a Media paragraph

The outline is similar to creating a text paragraph, but there are some extra steps:

  • Extract the image URL from the input HTML string.
  • Download the remote image.
  • Create a File entity.
  • Create a Media entity.

Then use the ID of the Media entity when creating the paragraph.

There should be a way to access the image URL using the existing DOMDocument object, but I could not get it to work. So I create a new one and extract the URL (and the alt text).

I found a handy Drupal function that downloads the image, creates a File entity, and returns a File object: system_retrieve_file(). For Drupal 8, this function was rewritten to use the Guzzle library instead of the old, problematic drupal_http_request().

File atom_migrate/src/Plugin/migrate/process/SplitIntoParagraphs.php (excerpt)

  /**
   * Create a Media paragraph.
   *
   * Save the file locally, creating a File entity of type image.
   * Create a Media entity of type image referencing that File entity.
   * Create a Media paragraph referencing that Media entity.
   *
   * @param string $blob
   *   The HTML string to use as the main text field. It should contain an <img>
   *   element. If there is more than one, only the first is used.
   *
   * @return int[]
   *   An array of entity/revision IDs keyed by 'target_id' and
   *   'target_revision_id'.
   */
  static protected function createMediaParagraph($blob) {
    $dom = new DOMDocument();
    $dom->loadHTML($blob, static::DOM_OPTIONS);
    $node = $dom->getElementsByTagName('img')->item(0);
    $src = $node->getAttribute('src');
    $alt_text = $node->getAttribute('alt');

    // Download and save the file.
    $destination = 'public://' . date('Y-m');
    $file = system_retrieve_file($src, $destination, TRUE);

    // Create the Media entity.
    $media = Media::create([
      'bundle' => 'image',
      'uid' => '9',
      'field_media_image' => [
	'target_id' => $file->id(),
	'alt' => $alt_text,
      ],
    ]);
    $media->save();

    // Create a Media paragraph entity.
    $paragraph = Paragraph::create([
      'type' => 'media',
      'field_media' => $media->id(),
    ]);
    $paragraph->save();

    return [
      'target_id' => $paragraph->id(),
      'target_revision_id' => $paragraph->getRevisionId(),
    ];
  }

 

Summary

That is most of the code for this plugin. To see it all in context, with a few more pieces (the namespace and use statements, and the DOM_OPTIONS constant), see the git repository.

Room for improvement

It is really satisfying to put all the pieces together, run the migration, and see the nodes added to the site. Still, I see a bunch of things that should be changed to make this more reusable:

  • I should add some error checking and logging.
  • I hard-coded the names of the Paragraph types (text_area and media).
  • I hard-coded the field names (field_text_area and field_media).
  • I hard-coded the user ID, the text format, and the file path.
  • This is only the second time I have worked with the DOMDocument class, and surely some of what I did can be simplified.

Some of those hard-coded values can be determined programmatically. For the others, I wonder if we can add a feature to the Migrate API that pulls settings from other configuration. Then we could provide a standard configuration form for the site administrator and use it to customize the migration.

I have not tested what happens when rolling back the migration. This will delete the created nodes: the Migrate API keeps track of them and makes sure they are removed in the rollback. I think that the Paragraph entities attached to the deleted nodes are also removed, but what about the Media and File entities? If they are not removed, then I need some extra code to clean them up.

Reviewing the documentation for MigrateProcessInterface, I think I should override the multiple() method.

References

In the spirit of Open Source, I have borrowed heavily from blog posts, Slack messages, documentation, and other forms of support. Here are links to some of the sources I found helpful for both parts of this post.

I mentioned it at the top but here again is the link to the full code for this post: https://github.com/isovera/atom_migrate.

Here are the Drupal modules mentioned in both parts of this post:

Want to learn more?

We offer a range of training and workshop options that cover everything from a basic ‘Intro to Drupal’ to layout and theming, security, performance, module development, and more.