Migrate Drupal WYSIWYG to Paragraphs

In November 2022, the Drupal community and the Drupal Security Team will end their support for Drupal 7. By that time, all Drupal websites will need to be on Drupal 9 to continue receiving updates and security fixes from the community. The jump from Drupal 7 to 9 is a tricky migration. It often requires complex transformations to move content stuck in old systems into Drupal’s new paradigm. If you are new to Drupal migrations, you can read the official Drupal Migrate API, follow Mauricio Dinarte’s 31 Days of Drupal Migrations starter series, or watch Redfin Solutions’ own Chris Wells give a crash course training session. This blog series covers more advanced topics such as niche migration tools, content restructuring, and various custom code solutions. To catch up, read the previous blog posts Drupal Migration Basic Fields to Entity References and Migration Custom Source Plugin.

So the Drupal 7 website you’ve been tasked with upgrading to Drupal 8 has WYSIWYG fields filled with various images, videos, iframes, and tables all with inconsistent formatting. You want to take advantage of this upgrade by switching to a more structured content editing system like Paragraphs, so all those special cases will have a consistent editor experience and display. There is too much content to do this manually, so you need an automated migration. But to impose all this structure, you need to intelligently divide ambiguous content into your specific destination paragraphs. This is a difficult migration task for a few reasons.

There are multiple paragraph types, so they can’t share one migration.
The original WYSIWYG content can be broken into several paragraphs of several of different types, but the Drupal migration API wants one source entity to become one destination entity as discussed in this blog post.
The destination node needs to reference the paragraphs in the exact order as the original content.

These are tricky situations without a widely agreed upon solution. There are some custom migration modules that could help like Migrate HTML to Paragraphs. However, they may not fit your exact guidelines, especially if your paragraphs are referencing further entities like media or ECK entities. So how do you handle this?

There is no perfect solution. That said, one recipe for success is to write a custom process plugin that breaks up the WYSIWYG content, creates the correct paragraphs on the fly (as well as any file/media/ECK entities), and returns the paragraph IDs in the correct order for the destination node to reference. But beware: these paragraphs can’t be rolled back or updated through the migration pipeline. This makes testing more difficult because running a migration update is no longer idempotent, which means running it multiple times in a row will stuff the database with orphaned paragraphs. This simplifies the whole problem to one custom plugin. Here, the credit belongs to Benji Fisher for the starting point code. Keep in mind there are two shortcomings with this code that we will resolve later on.

Let’s break it all down. The first step is to import the WYSIWYG data into a DOMDocument in order to programmatically analyze the HTML. The DOMDocument is a data tree where each HTML tag is represented as a DOMNode that references any tags inside of it as children. You want to split up this DOMDocument so that each piece of content gets neatly mapped into the best fitting paragraph type. Since most of the content is simple text, the text paragraph type will be the default. So you need to test each piece of content to see if it matches a different paragraph type like image, video, or table. If it doesn’t match any of them, then you can safely drop it into the default text paragraph. You should start this process by iterating through all the top-level DOMNodes. For example:

<div class="wysiwyg-container">
  <div class="top-level">
    <p class="text"></p>
  </div>
  <div class="top-level">
    <img class="image" />
  </div>
  <div class="top-level">
    <p class="text"></p>
  </div>
  <div class="top-level">
    <table class="table"/>
  </div>
</div>

When you start with the wysiwyg-container div, you will see four direct children with the top-level class. Starting with the first child, test for each special case. One of the advantages of using a DOMDocument is that you can use the getElementsByTagName function to ask any DOMNode if it has children tags like img, table, or iframe. On a match, turn all the content in the branch to a new paragraph. Otherwise, put each chunk of consecutive text into a new text paragraph. Here lies the first problem with Benji’s template: branches with mixed content.

Most of the time this isn’t an issue. If there’s a table inside a div, you probably want the entire div for your table paragraph. However, if there’s an image inside an anchor tag, your image paragraph will grab the image data but ignore the link, losing data along the way.

{# Top level tag #}
<div class="top-level">
  {# Link around image #}
  <a>
    {# Image #}
    <img />
  </a>
</div>

This may be an acceptable loss, or an issue that can be flagged for manual intervention. Otherwise, you’ll need a method that can recursively traverse an entire branch of the DOMDocument, cherry-pick the desired HTML element, and put the rest into the default bucket.

For instance:

  /**
   * Recursively navigate DOM tree for a specific tag
   * 
   * @param $post
   *   The full page DOMDocument
   * @param $parent
   *   The parent DOM Node
   * @param $tag
   *   The tag name.
   * @param string $content
   *   The content string to append to
   *
   * @return []
   *   Return array of DOM nodes of type $tag
   */
static protected function recursiveTagFinder($post, $parent, $tag, &$current) {
  $tagChildren = [];
  // Iterate through direct children.
  foreach ($parent->childNodes as $child) {
    // DOMText objects represent leaves on the DOM tree
    // that can't be processed any further.
    if (get_class($child) == "DOMText") {
      $current .= $post->saveHTML($child);
      continue;
    }
    // If the child has descendents of $tag, recursively find them.
    if (!is_null($child->childNodes) 
      && $child->getElementsByTagName($tag)->length != 0) {
      $tagChildren += static::recursiveTagFinder($post, $child, $tag, $current);
    }
    // If the child is a desired tag, grab it.
    else if ($child->tagName == $tag) {
      $tagChildren[] = $child;
    }
    // Otherwise, convert the child to HTML and add it to the running text.
    else {
      $current .= $post->saveHTML($child);
    }
  }
  return $tagChildren;
}

If a top level DOMNode indicates that it has an img tag, then this method will search all the DOMNode children to find the img tag and push everything else into the default text paragraph, the $current variable. There will likely be some disjointed effects when pulling an element out of a nested situation: a table may lose special formatting or an image may not be clickable as a link anymore. Some issues can be fixed in the plugin. In the migration, I checked if any img tags had an adjacent div with the caption class and stored that in the caption field on the paragraph. Again, others may need to be manually adjusted, like reordering paragraphs or adjusting a table. Remember that it’s much faster to tweak flagged migration content than to find and fix missing data manually.

On to the next issue, let’s dig into embedded media in Drupal 7. The tricky aspect here is that the media embed is not stored in the database as an image tag, but as a JSON object inside the HTML. This requires a whole new set of tests to parse it out. To start, check each DOMNode for the opening brackets of the object [[{ and the closing brackets }]]. If it only contains one or the other, then there isn’t enough information to do anything. If it contains both, then get the substring from open to close and run json_decode. This will either return an array of data from the JSON object or null if it’s not valid JSON. The data in this array should contain an fid key that corresponds to the file ID of the embedded image. That file ID can then be used to grab the image from the migration database’s file_managed table and create an image paragraph.

Those are the main gaps in Benji Fisher’s custom source plugin. Of course each implementation requires even more tweaks and data manipulations to get the exact migration to work correctly. Some content simply will not transfer cleanly, so remember to test thoroughly and stay in tight communication with your client in order to turn WYSIWYG chaos into neat paragraphs.

If you found this migration series useful, share it with your friends, and don’t forget to tag us @redfinsolutions on Facebook, LinkedIn, and Twitter.