./axel.leroy.sh

Developer, tech nerd and photographer

How to create a WXR/XML import file: migrating from a static site generator to Wordpress

19 Apr 2020

Many people on the Internet will instruct you how to migrate your website from Wordpress to a static website generator, but what happens if you have to go the other way around? Well, that’s what I had to do last year at Wedoogift, and I have compiled a few tips to help you out.

# The context

Up until last year, Wedoogift’s homepage was a static website built with Middleman. It was super fast to load (obviously, since it was static and served through AWS S3 and CloudFront) but it was a pain to update: only a developer could edit and deploy its content, and even worse, nobody in my team knew exactly how it worked as the developper who created the website left the company long before I joined.

Up until 2019, it wasn’t much of an issue: the website’s content had to be updated less than once a month. But the company had been growing and set up a Marketing team which had as its first mission to entirely revitalize the website. That meant giving the Marketing department more autonomy and control over the homepage, and obviously static website generators were not going to cut it. Multiple solutions were studied but ultimately the company settled on using Wordpress.

I’ll spare you the details, but a contractor put together a Wordpress site with some plugins and a custom theme, and helped the Marketing department create brand-new pages. Two things were then left to be migrated over to the new homepage: the blog’s articles and the list of shops where you can use our vouchers.

Migrating the articles was fairly straightforward as the blog was already powered by Wordpress: create an XML export containing the articles, import it in the new Wordpress instance, the latter automatically downloads the attached pictures, eventually fix a few things using SQL queries, and you’re done.

Migrating the shops, in the other hand… was not that straightforward. Let me explain why:

# The problem

For starters, the shops were stored in a single JSON file and I had to create a new page in Wordpress for each of them. To add complexity, the pages were using custom post types and had additional data to be filed in like the shop category or a link to a help page.

Instinctively, I thought that I would just have to create an XML import file from JSON and my job would be done… but in retrospect I got a bit ahead of myself.

Why, would you ask? Well, one would think that WXR (Wordpress eXtended RSS), the format of the XML import file, would be widely documented, right? I could not have been more wrong: after hours of research, I have only found a single page documenting it, and it was not very useful nor complete. Other pages I found just straight-up suggested to reverse-engineer the parsing from the source code.

Well, I was certainly not willing to read hundreds of lines of PHP and it would certainly have not handled the additional data I needed to add.

Instead of reading the source code, I set out to understand the structure from an export file generated from pages on the target Wordpress install.


# The findings

Disclaimer: Do not consider the following as exactly true. These are my observations, which may be wrong: some attributes may be ignored by the importer, or I may have made a wrong guess on the meaning of other attributes. Always do a backup before importing and if you can, try first on a test environment.

From there, I found some interesting bits: the first being that Wordpress import files are actually RSS 2.0 with Wordpress-specific namespaces!

The second being that every media (pictures, videos, etc.) are Wordpress posts with types and metadata specific to media files. And since they have IDs, you can then reference them in posts or pages for stuff like thumbnails.

Without further ado, let me break down how Wordpress import files are built:

# Namespaces and website description

First, the WXR is initialized with the RSS, Dublin Core and Wordpress namespaces:

<?xml version='1.0' encoding='UTF-8'?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:excerpt="http://wordpress.org/export/1.2/excerpt/"
     xmlns:wp="http://wordpress.org/export/1.2/"
     version="2.0">

Then, a <channel> is created containing information on the website:

<channel>
    <title>My website title</title>
    <link>https://domain.tld</link>
    <language>fr_FR</language>
    <wp:wxr_version>1.2</wp:wxr_version>
    <wp:base_site_url>https://domain.tld</wp:base_site_url>
    <wp:base_blog_url>https://domain.tld</wp:base_blog_url>
    <wp:author>
      <wp:author_id>1</wp:author_id>
      <wp:author_login><![CDATA[username]]></wp:author_login>
      <wp:author_email><![CDATA[user@domain.tld]]></wp:author_email>
      <wp:author_display_name><![CDATA[User Name]]></wp:author_display_name>
      <wp:author_first_name><![CDATA[User]]></wp:author_first_name>
      <wp:author_last_name><![CDATA[Name]]></wp:author_last_name>
    </wp:author>
    <generator>https://wordpress.org/?v=5.2</generator>
    ...
</channel>

A few things to note here:

# Post content

Lets now dissect a post and its thumbnail.

<item>
  <title>thumbnail</title>
  <link>https://domain.tld/assets/thumbnail.png</link>
  <dc:creator><![CDATA[username]]></dc:creator>
  <description></description>
  <wp:post_id>1</wp:post_id>
  <wp:post_date><![CDATA[2019-05-27 10:08:23]]></wp:post_date>
  <wp:post_date_gmt><![CDATA[2019-05-27 08:08:23]]></wp:post_date_gmt>
  <wp:comment_status><![CDATA[closed]]></wp:comment_status>
  <wp:ping_status><![CDATA[closed]]></wp:ping_status>
  <wp:post_name><![CDATA[thumbnail]]></wp:post_name>
  <wp:status><![CDATA[publish]]></wp:status>
  <wp:post_parent>0</wp:post_parent>
  <wp:menu_order>0</wp:menu_order>
  <wp:post_type><![CDATA[attachment]]></wp:post_type>
  <wp:post_password><![CDATA[]]></wp:post_password>
  <wp:is_sticky>0</wp:is_sticky>
  <wp:attachment_url><![CDATA[https://domain.tld/assets/thumbnail.png]]></wp:attachment_url>
  <wp:postmeta>
    <wp:meta_key><![CDATA[_wp_attached_file]]></wp:meta_key>
    <wp:meta_value><![CDATA[2019/05/thumbnail]]></wp:meta_value>
  </wp:postmeta>
  <guid isPermalink="false">https://domain.tld/assets/thumbnail.png</guid>
</item>
<item>
  <title>My Page</title>
  <link>https://domain.tld/my-page</link>
  <dc:creator><![CDATA[username]]></dc:creator>
  <description></description>
  <wp:post_id>1242</wp:post_id>
  <wp:post_date><![CDATA[2019-05-27 10:08:23]]></wp:post_date>
  <wp:post_date_gmt><![CDATA[2019-05-27 08:08:23]]></wp:post_date_gmt>
  <wp:comment_status><![CDATA[closed]]></wp:comment_status>
  <wp:ping_status><![CDATA[closed]]></wp:ping_status>
  <wp:post_name><![CDATA[my-page]]></wp:post_name>
  <wp:status><![CDATA[publish]]></wp:status>
  <wp:post_parent>0</wp:post_parent>
  <wp:menu_order>0</wp:menu_order>
  <wp:post_type><![CDATA[page]]></wp:post_type>
  <wp:post_password><![CDATA[]]></wp:post_password>
  <wp:is_sticky>0</wp:is_sticky>
  <content:encoded><![CDATA[
<h1>Some title</h1>
<p>Lorem Ipsum</p>
]]></content:encoded>
  <excerpt:encoded><![CDATA[<p>Lorem Ipsum</p><p>Sin dolor amet</p>]]></excerpt:encoded>
  <category domain="univers" nicename="mode-beaute"><![CDATA[Mode & beauté]]></category>
  <wp:postmeta>
    <wp:meta_key><![CDATA[_thumbnail_id]]></wp:meta_key>
    <wp:meta_value><![CDATA[1241]]></wp:meta_value>
  </wp:postmeta>
</item>
<item>
  <title>My Article</title>
  <link>https://domain.tld/my-article</link>
  <dc:creator><![CDATA[username]]></dc:creator>
  <description></description>
  <wp:post_id>1243</wp:post_id>
  <wp:post_date><![CDATA[2019-05-27 10:08:23]]></wp:post_date>
  <wp:post_date_gmt><![CDATA[2019-05-27 08:08:23]]></wp:post_date_gmt>
  <wp:comment_status><![CDATA[closed]]></wp:comment_status>
  <wp:ping_status><![CDATA[closed]]></wp:ping_status>
  <wp:post_name><![CDATA[my-page]]></wp:post_name>
  <wp:status><![CDATA[publish]]></wp:status>
  <wp:post_parent>0</wp:post_parent>
  <wp:menu_order>0</wp:menu_order>
  <wp:post_type><![CDATA[post]]></wp:post_type>
  <wp:post_password><![CDATA[]]></wp:post_password>
  <wp:is_sticky>0</wp:is_sticky>
  <content:encoded><![CDATA[
<h1>Some title</h1>
<p>Lorem Ipsum</p>
]]></content:encoded>
  <excerpt:encoded><![CDATA[<p>Lorem Ipsum</p><p>Sin dolor amet</p>]]></excerpt:encoded>
  <category domain="category" nicename="blog-posts"><![CDATA[Blog posts]]></category>
  <category domain="post_tag" nicename="articles"><![CDATA[articles]]></category>
  <category domain="post_tag" nicename="hello-world"><![CDATA[hello world]]></category>
  <wp:comment>
    <wp:comment_id>1</wp:comment_id>
    <wp:comment_author><![CDATA[Some Visitor]]></wp:comment_author>
    <wp:comment_author_email><![CDATA[visitor@wordpress.example]]></wp:comment_author_email>
    <wp:comment_author_url>https://wordpress.org/</wp:comment_author_url>
    <wp:comment_author_IP><![CDATA[127.0.0.1]]></wp:comment_author_IP>
    <wp:comment_date><![CDATA[2020-04-19 17:57:25]]></wp:comment_date>
    <wp:comment_date_gmt><![CDATA[2020-04-19 15:57:25]]></wp:comment_date_gmt>
    <wp:comment_content><![CDATA[<p>Comment HTML</p>]]></wp:comment_content>
    <wp:comment_approved><![CDATA[1]]></wp:comment_approved>
    <wp:comment_type><![CDATA[]]></wp:comment_type>
    <wp:comment_parent>0</wp:comment_parent>
    <wp:comment_user_id>0</wp:comment_user_id>
  </wp:comment>
  <wp:postmeta>
    <wp:meta_key><![CDATA[_thumbnail_id]]></wp:meta_key>
    <wp:meta_value><![CDATA[1241]]></wp:meta_value>
  </wp:postmeta>
</item>

# Attachments

For attachments, you can ask Wordpress to download them when importing by checking “Download and import file attachments”. Do not forget to set <wp:attachment_url>!

You can also set where it will be downloaded by setting the following post_meta:

<wp:postmeta>
    <wp:meta_key><![CDATA[_wp_attached_file]]></wp:meta_key>
    <wp:meta_value><![CDATA[2019/05/thumbnail]]></wp:meta_value>
  </wp:postmeta>

Finally, you can use the attachment as a post or page thumbnail by setting the thumbnail_id meta with the attachment’s ID.

# Custom fields

If the destination Worpress install uses Advanced Custom Fields (ACF), you can fill Custom Fields, but this is a bit trickier.

Basicaly, you would fill it by adding the following postmeta where my-custom-field is your field’s name (not to be confused with the field’s title):

<wp:postmeta>
    <wp:meta_key><![CDATA[my-custom-field]]></wp:meta_key>
    <wp:meta_value><![CDATA[Custom field data]]></wp:meta_value>
  </wp:postmeta>

But in order for it to work, you have to link this field to ACF’s field by adding another postmeta with you custom field’s name prefixed by an underscore as the key:

<wp:postmeta>
    <wp:meta_key><![CDATA[_my-custom-field]]></wp:meta_key>
    <wp:meta_value><![CDATA[field_xxxxxxxx]]></wp:meta_value>
  </wp:postmeta>

You will notice that I filled field_xxxxxxxx as the meta’s value. That’s because we will need to find the ID under which ACF saved the field in database. Luckily, ACF’s custom fields are plain posts with acf-field as their type.

The following SQL request will give use every custom fields saved in Wordpress: you just have to pick the post_name that matches with your field!

select post_title, post_excerpt, post_name from wp_posts where post_type = 'acf-field';
+-----------------+-----------------+---------------------+
| post_title      | post_excerpt    | post_name           |
+-----------------+-----------------+---------------------+
| My Custom Field | my_custom_field | field_5e9d99af9f6b7 |
+-----------------+-----------------+---------------------+

# A Python library for automating the file generation

To make my life easier, I developed the following Python library:

This library contains has a few useful features:

And here is an example of their use:

# Import LXML to manipulate XML files
from lxml import etree as ET
from lxml.etree import CDATA

# Import wordpress specific nodes
from wxr_utils import WP
from wxr_utils import CONTENT
from wxr_utils import EXCERPT

from wxr_utils import create_root_node
from wxr_utils import create_channel_node
from wxr_utils import create_item_node
from wxr_utils import create_text_node
from wxr_utils import create_post_meta_node
from wrx_utils import serialize_array
from wxr_utils import write_xml

HOME_ROOT = "https://domain.tld"
FILENAME = "out/export.xml"

# Creates the <rss> root node
root = create_root_node()
# Creates the <channel> node and fills the website's information
channel = create_channel_node(root, 'My awesome website', WEBSITE_ROOT, 'fr_FR')

# Adding a picture
logo_url = "https://domain.tld/path/to/picture.jpg"
logo_item = create_item_node(
        parent=channel,
        post_id="{0}".format(10),
        title="logo",
        link=logo_url,
        post_name="logo",
        status="publish",
        post_type="attachment")

logo_path = "{0}/{1}/{2}".format('2020', '04', "picture")
create_text_node(logo_item, WP + "attachment_url", CDATA(logo_url))
create_post_meta_node(logo_item, "_wp_attached_file", logo_path)
guid = ET.SubElement(logo_item, "guid", isPermalink="false")
guid.text = logo_url

# Adding a post
slug = "my-article"
item = create_item_node(
        parent=channel,
        post_id="{0}".format(11),
        title="My article",
        link="{0}/{1}".format(WEBSITE_ROOT, slug),
        post_name=slug,
        status="publish",
        post_type="post")
create_text_node(item, CONTENT + "encoded", CDATA("<p>Article content</p>"))
create_text_node(item, EXCERPT + "encoded", CDATA("<p>Article excerpt</p>"))

# Adding a category to the post
# NB: you can add more than one category
cat_node = ET.SubElement(item, 'category')
cat_node.set("domain", "category")
cat_node.set("nicename", "category-slug")
cat_node.text = CDATA("Category name")

# Adding a tag to the post
# NB: you can add more than one tag
cat_node = ET.SubElement(item, 'category')
cat_node.set("domain", "post-tag")
cat_node.set("nicename", "tag-slug")
cat_node.text = CDATA("Tag name")

# Add the picture as thumbnail
create_post_meta_node(item, "_thumbnail_id", "{0}".format(10))

# Fill a text custom field
create_post_meta_node(item, "my-custom-field", "Some text")
create_post_meta_node(item, "_my-custom-field", "field_5e9d99af9f6b7")

# Fill a link custom field
cta_array = ['title', 'Click Me!', 'url', 'https://domain.tld/link/of/cta', 'target', '_blank']
create_post_meta_node(item, "link-cta", serialize_array(cta_array))
create_post_meta_node(item, "_link-cta", "field_5cb2a93cddf5b")

# Save files
write_xml(root, FILENAME)

Feel free to modify my library in order to modify fields I didn’t need to set, such as post_date, is_sticky or post_parent, or even add functions to automate the creations of nodes such as <category> and good luck migrating to Wordpress!