Transcript: Unraveling the ETL Data Migration Process: Understanding Extract

This is an edited transcript. For the blog post and video, see Unraveling the ETL Data Migration Process- Part 1: Understanding Extract.

[00:00:06] Janez Urevc: Welcome to Tag1 Team Talks brought to you by Tag1 Consulting. With Drupal 7 and Drupal 9 rapidly approaching end of life, we are hearing people talk about migrating and upgrading more than ever before.

[00:00:20] Janez Urevc: And anyone who's ever been involved with a large scale migration, migrating a large site or application, from one technology stack to another, will tell you that it's complex, time consuming, and it demands expertise. That's why we are bringing you the series of talks diving deep into the world of Drupal migrations, and who better to guide us than Tag1's very own Drupal migration experts.

[00:00:47] Janez Urevc: From the masterminds and maintainers of Drupal's migration tooling, to the individuals behind the most groundbreaking Drupal migrations, we've got an all star lineup who'll cover everything you need to know about [00:01:00] every aspect of migrating large scale applications. This team talk is part,

[00:01:07] Janez Urevc: of the three part series about ETL. Extract Transform, and Load process, which is used by many enterprise migration systems, Drupal's Migrate included. In today's episode, we're going to talk about how to use Drupal Migrate system to extract data from a Drupal 7 database or any other third party source.

[00:01:30] Janez Urevc: Be sure to stick around to the end because we are also going to announce next few talks in our series. Let's dive in.

[00:01:39] Janez Urevc: I'm Janez Urevc, Senior Engineer here at Tag1, and a longtime contributor to Drupal. I'm joined today by well known top contributors to Drupal, Benji Fisher, one of the five current Drupal Migrate core subsystem maintainers, and Mike Ryan, co creator of Migrate.

[00:01:56] Janez Urevc: Welcome. Thank you for joining me. [00:02:00] Um, we are glad that you are here.

[00:02:07] Mike Ryan: Present.

[00:02:07] Janez Urevc: Okay. For the beginning, let's, uh, let's explain what ETL stands for. Um, like where, where does this concept come from? And, um, What are other ETL systems besides Drupal Migrate? Mike, I'm sure that you have plenty of experience in this area.

[00:02:31] Mike Ryan: Yeah. Um, I think the ETL concept really came from mainframe data migrations, uh, back in the day. I have little doubt that many of them were coded in COBOL. Uh, my first introduction actually was before Drupal. I worked in, uh, HR, human resources information services for a hospital in Boston, and we were switching HR systems.[00:03:00]

[00:03:00] Mike Ryan: So, um, I was, uh, Primary person responsible for doing that migration. And that was a classic Extract Transform Load. We would extract the data from the original system into CSV files, and I loaded them into Microsoft Access, where I created queries to transform the exported CSV format into the format for importing into the new system.

[00:03:38] Mike Ryan: And then, you know, we'd export those CSV files and import them or load them into the destination system. So that was my first introduction to ETL.

[00:03:51] Benji Fisher: So access wasn't the final destination. That was just your, what you used for the transform stage.

[00:03:57] Mike Ryan: Yes. Um, we, we didn't [00:04:00] have. You know, uh, database server to play with or anything like that in between.

[00:04:08] Mike Ryan: We're, we're basically limited to, you know, Microsoft Office tools, because this is, we weren't really a data center. We weren't engineers. Or it wasn't an engineering environment. I was a software engineer, previously, but, um, that I learned a lot. And then, uh, around this time at a Drupal meetup in Boston, Moshe Weitzman was also there and people were talking about trying to import raw HTML into Drupal and, um, you know, Moshe looked across the table at me and said, you know, there's There's an opportunity here, which is when we got together to, [00:05:00] uh, start developing a migration system for Drupal.

[00:05:05] Mike Ryan: And of course, since we wanted to, and I'm getting ahead of ourselves a little bit, but, um, the original concept was importing raw HTML, but of course, you also want to be able to extract from databases or CSVs and so on, which led to the idea of a pluggable system. Uh, in specifically. Relevant to today's talk, uh, pluggable extract, um, plugins.

[00:05:40] Janez Urevc: Did you, did you have like a client or a website where you needed to use this? Or was just like, uh, you saw the opportunity and you decided that you want to build a system?

[00:05:56] Mike Ryan: Yeah, we, we started, um, [00:06:00] building a system. I'm trying to remember what the first, was the Economist the first client? Yeah. I think there was, there was one before that we had, uh, this is, feels like ancient history, like what, 15 years ago.

[00:06:18] Mike Ryan: Um, yeah, but we, we saw the need. We saw that sometimes, um, the Drupal upgrade process could be difficult. And sometimes it might be easier to treat it as a migration rather than, uh, with update hooks as was the traditional method at the time.

[00:06:43] Janez Urevc: Yeah, that, that discussion, I already remember, I've been around for that.

[00:06:47] Janez Urevc: That happened like when we were figuring out how people were going to be upgrading their Drupal 7 sites to Drupal 8, because it was so different.

[00:06:58] Mike Ryan: Yes, the, the [00:07:00] update method of it prevented making that sort of massive change to Drupal really, because there's no way you could really completely refactor the database. Um, schema using update hooks.

[00:07:17] Janez Urevc: Yes. Um.

[00:07:21] Janez Urevc: So

[00:07:23] Janez Urevc: so ETL stands for Extract, which is about pulling the data from the legacy source, basically, Transform, which is about processing the data along the way and then Load, um, which is basically storing it into the destination system. And today's topic is E, extract. So let's, let's dive into that. Um, what is being done as part of the extract phase and more importantly, how is that done in Drupal, Benji?[00:08:00]

[00:08:00] Benji Fisher: So there are a variety of plugins. Um, yeah, typically when, or typically we're doing a migration from Drupal six or seven and we have a database. So we have, so we know a lot about the structure of the data and there are. Uh, what we call source plugins built into Drupal. Um, Drupal Migrate API has its own terminology, so the extract phase is called source plugins.

[00:08:27] Benji Fisher: Um, but the source can be lots of other things. If you're not talking about migrating from Drupal 6 or 7, you might have a CSV source, just a spreadsheet, or a spreadsheet in some other format like Excel. Uh, you could have JSON, it might be JSON API, um, XML or SOAP. Um, someone suggested that you might want to import emails into Drupal and then you'd have an IMAP formatted file.

[00:08:58] Benji Fisher: I've never seen that, but it's [00:09:00] certainly something that could be done. And, um...

[00:09:04] Benji Fisher: I do want to point out that migrate system can be used not just for creating a Drupal site out of some legacy system. It can also be used for recurring migrations. And typically when, when you're doing that, the source is going to be any of those formats served over an API from somewhere else. And you might be importing articles or events.

[00:09:28] Benji Fisher: Users, um, just in any sort of content that you have, um, in any external system and you want to get it into Drupal.

[00:09:40] Janez Urevc: Yeah. Or you could use it with commerce, for example, and import, um, products on an ongoing basis from some other system where that is the main source of truth for your products.

[00:09:57] Benji Fisher: Yeah, I've, uh, I've worked on [00:10:00] projects that were using QuickBooks and we wanted to get information going both ways.

[00:10:06] Benji Fisher: We wanted to get, um, I guess, prices out of QuickBooks into Drupal and we wanted to get orders out of Drupal into Commerce.

[00:10:18] Benji Fisher: And I think we'll talk about that a little later, um, or maybe in, uh, in one of the, the other episodes. The funny thing is that we ended up using the Migrate API for both directions.

[00:10:34] Janez Urevc: Which, uh, proves how well architected it is, in my opinion. Like the fact that you can, that you can come up with use cases.

[00:10:43] Janez Urevc: That are definitely not obvious, um, rules, how powerful it is. And I think that the main or not the main, but one of the reasons why it's so powerful, it's that it's, uh, [00:11:00] architected as a pluggable system, um, where you have different plugin types and then you can reuse code by implementing plugins and reusing them on different migrations.

[00:11:12] Janez Urevc: Um, and this is something I would like to ask you, Mike. Um, because this is so core of the migrate nature. Um, did you, did you know from the beginning it should be a pluggable system or how, how did you decide that you want to architect it this way? Um, and maybe, maybe you can briefly talk about which plugin types do we have.

[00:11:39] Janez Urevc: Um, and yeah, why, why, it was done this way in the first place.

[00:11:45] Mike Ryan: Uh, well, I, I did allude to that earlier, and that is, um, the original concept we were talking about importing HTML in, um, raw HTML pages into Drupal. But of course, once you're using [00:12:00] migration for, um, Drupal upgrades or side grades, which actually that was a common use case we saw where someone wanted to re architect their site on the same version of Drupal.

[00:12:15] Mike Ryan: And transform their, um, content types. So anyway, um, at the time this was before Drupal had a general purpose entity system and pretty much every module that had data had its own tables with their own schema. So you, we needed to have distinct plugins, distinct sources for each of those, as well as to handle, um, occasionally you'd need to import a CSV file into a Drupal table or, um, again, HTML.

[00:12:51] Mike Ryan: And once, once you start seeing things as pluggable, it becomes natural to just make [00:13:00] everything pluggable, all the more so, uh, since Drupal 8, which is, um, Entirely built based on the plugin concept.

[00:13:12] Janez Urevc: Yeah. Like the idea of plugins was around even before I remember, like you had to like C tools provided plugins in Drupal seven world and.

[00:13:25] Mike Ryan: Even six, I think, uh, I think the Views module had plugins,

[00:13:32] Janez Urevc: Right? Yeah, it could be that even the, the thing that was in C tools came from Views and was then extracted into the C tools because C tools were a dependency for Views, right?

[00:13:44] Mike Ryan: Yeah. Earl extracted, uh, C tools from Views. As I, if I recall correctly.

[00:13:50] Janez Urevc: Yeah, and then in Drupal 8, we got this, it was basically completely re, re architected plugin system and [00:14:00] then everything moved to that. And, um, yeah, it's quite pleasant to work with in my opinion.

[00:14:06] Benji Fisher: Right. Even plugins have plugins, you can have annotations or YAML, or I think there's a third way of specifying the plugin structure.

[00:14:18] Janez Urevc: You can always get very crazy about, you know, we, like when I was working on the media initiative, um, we also use plugins a lot and made everything pluggable, like sometimes when I, when I look back at it, maybe we even made everything too abstract and I think this. Drupal community loves to make everything abstract and everything pluggable and, you know, not opinion and that, that's very powerful, but sometimes it ends up seeing very complex systems.

[00:14:53] Janez Urevc: And now looking back, like I would, I would do some [00:15:00] things in a more simple way. Um, in the media ecosystem, but, um, yeah, Migrate totally made sense. Every, all the time to me, like I never had issues understanding how it works. So great job, Mike. Um, so we already mentioned that going from Drupal 7 to Drupal 8, uh, Migrate became part of core, which is another proof.

[00:15:27] Janez Urevc: Uh, how powerful and how solid it is, um, which means that, um, there is some basic functionality that ships with Drupal Core, but then, um, we also have a huge number of modules or plugins in the contributed space, um, which can help you run more esoteric migrations. Uh, Benji, could you tell us what, what ships with Core and [00:16:00] what is that mostly used for? And what would we have to turn to Contrib for?

[00:16:08] Benji Fisher: Right. So in terms of source plugins for the Extract phase. Um, Core provides the the API that everything is built on. So there will be some, um, sort of base classes.

[00:16:22] Benji Fisher: There's a base class for a SQL source. Um, there's the content entity, um, content entity and config entity base classes. Um, but other than the basic things needed for the API. The criterion for being in core is that, um, the migrate system was the way to upgrade a Drupal 7 site to Drupal 8 or later. And so, um, [00:17:00] the core migrations are almost entirely focused on that use case.

[00:17:05] Benji Fisher: So, in terms of source plugins, uh, we have Drupal 6 sources and Drupal 7 sources. That's basically the only thing that is supported in core. So, for your migration projects, getting your pre Drupal 8 databases migrated into modern Drupal, um, that's what's in core. Anything else is going to be in contrib. Um, The only exception I can think of is the, the content entity, which is using the current database as a source.

[00:17:41] Benji Fisher: Um, so Mike was talking earlier about side, what did you call it? Side migrations? Side grades. Um,

[00:17:48] Mike Ryan: side grades.

[00:17:50] Benji Fisher: Right, so you, you can do that in the current site. Using the content entity source [00:18:00] and just loop through all of your badly architected foo content types and creating bar content types out of them.

[00:18:09] Benji Fisher: Um, so for any other source plugins, we have to turn to contrib. There is the Migrate Plus module, which is maintained by, um. By Lucas Hedding, one of the, uh, one of the other maintainers, current maintainers of the Migrate API. So that's sort of considered core adjacent, um, things that seem generally useful but don't meet the criterion for being in core go into there.

[00:18:41] Benji Fisher: So there is, um, a source plugin there for SQL table. And then, uh, really generic, there is a URL source plugin, so anytime you're getting something from an API, you're, you're going to be using the URL plugin, [00:19:00] and then the Migrate Plus module introduces its own plugins. So we, we know what the API is, um, we need, typically some sort of authentication, so there are authentication plugins.

[00:19:13] Benji Fisher: Um, we need to, um, fetch the data and then parse it. So there are three systems of plugins tied into the URI. I'm sorry, the URL plugin from Migrate Plus. And then there are, uh, a bunch of... Uh, more specialized contrib modules, um, there's one for CSV files, one for other spreadsheets like Excel, um, there's a module devoted to, um, Drupal Commerce and, uh, both earlier versions of Commerce and I think some other systems like Magento are, um, are handled by that.

[00:19:58] Benji Fisher: There's the WordPress Migrate [00:20:00] module, um, which is a specialized XML source. The idea there is you export your WordPress site to XML, which WordPress provides that functionality, and then we import the XML into Drupal. Um, there's... Um,

[00:20:15] Mike Ryan: Just to note, there are other systems that, um, Implement the WordPress XML schema.

[00:20:22] Mike Ryan: So not just WordPress, but there are certain other, I can't remember which off the top of my head that use the same, uh, XML schema. So you could use the WordPress migrate to import from those.

[00:20:37] Benji Fisher: Cool. I didn't know that.

[00:20:39] Janez Urevc: So no reason to use WordPress anymore, right?

[00:20:44] Benji Fisher: There there's a Views Migration Module, and that's something that we just decided Drupal Core was too hard. There are too many variations contrib modules can make their own views plugins. And we just didn't think there would. [00:21:00] a reliable way to, to migrate views from Drupal 7 into modern Drupal.

[00:21:07] Benji Fisher: There is a contrib module that attempts to do that. Use it with care because it is a hard problem. Don't expect it to be the final answer to migrating your views, but it can at least give you a start. Um, and I, I just saw yesterday looking at the list there. There's one that migrates playlists from YouTube.

[00:21:30] Benji Fisher: Um, so, so a lot, there are a lot of these specialized source plugins in contrib space.

[00:21:38] Janez Urevc: The YouTube one could be used for, uh, like incremental migrations. That we mentioned earlier, right? Like if you have a YouTube channel and you want to pull the videos that you upload there into your Drupal site automatically, you could use that.

[00:21:57] Benji Fisher: Exactly. Yeah. I don't, I [00:22:00] suppose there might be some sites that would want to download the actual videos and they could do that.

[00:22:05] Benji Fisher: Typically we leave the videos on YouTube, let YouTube do what it does well. And, and migrate the metadata of the playlists into the Drupal site, and that would be recurring as you make new playlists as you add to the playlists. You'd want to import the new data.

[00:22:21] Mike Ryan: Yeah, you could have a, um. Uh, cron job that runs the migration.

[00:22:28] Janez Urevc: It's amazing, like how thriving the contrib is. And, you know, that's usually the case in Drupal, but you know, it amazes me over and over again, how many things developers come up with. And, you know, the ideas that people have just mind blowing, like sometimes you, you run into things that, that you would never think about and yet somebody thought about it and have need for it.

[00:22:53] Janez Urevc: Um, it's, it's really, truly amazing. Um, so I [00:23:00] personally have done a few migrations in my career, but they were mostly, uh, migrations from third party CMSs. Um, I worked a lot with media companies and a lot of times media companies would have sometimes even internally developed CMSs and, uh, in one of the projects that I've been involved with, they realized that that is not sustainable.

[00:23:28] Janez Urevc: So they decided to, first they decided that they want to go with something open source, and then they decided to go to Drupal. Um, so we had to migrate from that and we were basically using the equal. source plugin, loading data from the third party legacy database, and then importing into Drupal. But I'm pretty sure that Mike Ryan, you have, uh, more, more interesting use cases that you've experienced, uh, throughout your career.

[00:23:59] Janez Urevc: So [00:24:00] can you talk about that?

[00:24:03] Mike Ryan: Um, really would have to do some archeology on my memory to dig that up. Um, you know, one interesting use case, I can't remember exactly what project it was, but where there was associated data that the primary content was in the database, but there was associated content which needed to be.

[00:24:33] Mike Ryan: Um, pulled from a web API. So the, um, source plugin, the source plugin has hooks that you can implement to add to what the raw source plugin does. the source plugin and produces one row of data equivalent to an SQL row at a time. And you can implement the [00:25:00] process row hook to, to add to it.

[00:25:04] Mike Ryan: Within the source plugin, so you can read the additional data from an API or CSV or so on and add it to that row before it goes into the transform part of the pipeline.

[00:25:18] Janez Urevc: Interesting. How did that affect performance? Because I assume that if you are calling APIs on every row, that can slow the migration down quite significantly.

[00:25:30] Mike Ryan: Yeah, it's, I mean, it really depends on the source. Yeah, sure. Yeah, if you're pulling from a local file, not so bad, or, but if it's a web API, then that will slow things down.

[00:25:52] Benji Fisher: As long as we're mentioning the process row step, it's worth mentioning that in [00:26:00] a Drupal to Drupal migration. Um, there are a lot of fieldable entities in Drupal six or Drupal seven, um, content types of fields on them, for example.

[00:26:10] Benji Fisher: And we, we structure it the same way. We have the base table that gives us the, say the node IDs if you're migrating nodes and it produces, uh, you know, a bunch of rows. And then we use that, that same process, row stage, to attach all, all of the fields to that row, one, one row at a time.

[00:26:34] Mike Ryan: Yes. And although I said that it's equivalent to an SQL row, but it's, um, a logical row in an SQL row, you're going to have, um, scalar, uh, values.

[00:26:55] Mike Ryan: Um, while the row, the, um, [00:27:00] row object that we're passing through we can have multi value fields in there. So in the prepare row, the, um, if there are associated fields with multiple values, those can be extracted in the prepare row and plugged into an array. As a member of the row, that's going to go through the pipeline.

[00:27:27] Mike Ryan: And then the rest of the pipeline knows how to handle. Uh, arrays and how to process each member of that array within the value.

[00:27:39] Janez Urevc: Benji, I heard that you migrated a single page into several nodes in the past. Can you tell us about that?

[00:27:47] Benji Fisher: Yeah, this, this was kind of funny. It was part of, uh, uh, a migration from Drupal 7 to Drupal 9, I think. And, uh, and most of it was pretty routine, but they also had. One [00:28:00] page, which looked like it was a views listing of some person content type and each person had, uh, several publications and they had a picture and so on and so forth.

[00:28:15] Benji Fisher: Um, but in fact, it, it wasn't from structured data. It was one Drupal 7 basic page. Um, with 10 or 20 of these, um, things that looked like, like, like nodes, um, just pasted into the body field. And, uh, we decided we were going to restructure it and make it what it looks like and have a person content type, but with structured data on it.

[00:28:45] Benji Fisher: Uh, but we had to extract it from this page since it was only 10 or 20. And this is the funny part. I said, you know, get an intern, have them copy and paste. You will spend as much of the intern's time as you would spend of [00:29:00] my time. Um, And, and the insurance would be a lot cheaper, but they said, no, no, no, we want to do it automatically.

[00:29:07] Benji Fisher: I said, okay, this will be fun. And, uh, as far as the source was concerned, um, I, I just downloaded that page once and put it into the, uh, the directory for, for the custom module we had for migration. Um, I ran HTML Tidy on it to make sure it was well formed XHTML. And, uh, and then I used the URL plugin that I was talking about from Migrate Plus and, uh, and one of the XML data parsers from Migrate Plus.

[00:29:45] Benji Fisher: And, and I guess the important thing to mention is that, um, even though it was not technically structured data, it was just one page of HTML, it did have consistent structure. So every section of the page started off with an H3. [00:30:00] Um, and directly beneath that there was, uh, an image, um, so there, there was enough structure there that I could reliably pick it apart and create fields on the, uh, the person nodes.

[00:30:13] Benji Fisher: That I was creating.

[00:30:15] Janez Urevc: That's very interesting. And the interesting part is that you've been able to use contrib and not have needs for custom code with that. Did you, did you use XPath, uh, to, to parse that markup or something else?

[00:30:31] Benji Fisher: Yeah, and anytime that you're parsing XML, and I, as I said, I created XHTML and so anytime you're parsing XML, you want to use XPath, which, um, I'll have more to say about in the episode on the transform or process plugins.

[00:30:48] Janez Urevc: Which, uh, which will be the next episode in this mini series. Um, so definitely join us again for [00:31:00] the "T" part of ETL, which is Transform. Um, besides this mini series, we have some great talking coming up. Um, our goal is to put one episode per week over the next few months, um, to support the community in the migration process.

[00:31:18] Janez Urevc: Um, we will talk about performance because performance is something that we care deeply about at Tag1 and it applies to migration as well. Because when you are handling really large data sets, a full data migration can easily take like 12 hours or sometimes even days. Um, so we'll do a handful of talks on this topic, including on how to profile and tune a migration.

[00:31:46] Janez Urevc: Including long running migrations and, uh, we'll also talk on incremental migrations, which we touched on in today's episode already. Um, how you can include or exclude things and run [00:32:00] a migration on a subset of data to make it faster. Uh, then, um, every project owner wants their migration to be a success and we will dedicate an episode to discuss the most important factors for a successful Drupal 7 to 10 migration in order to help you successfully navigate your migration project.

[00:32:19] Janez Urevc: And other topics that we will cover in the future include porting custom code from Drupal 7 to Drupal 10, the future of migrate tooling, how to port a team. And so much more. We hope that you'll tune in and enjoy our upcoming team talks.

[00:32:38] Janez Urevc: A huge thank you to our Tag1 team, Benji Fisher and Mike Ryan. Thank you for joining me.

[00:32:45] Janez Urevc: Make sure you check out the other segments in this series. Uh, there will be links to them in the show notes along with all the other links that we mentioned today. If you like this talk, please remember to upvote, subscribe, and share it. [00:33:00] Um, you can check our past talks at tag1. com slash TTT. That's three T's for Tag1 Team Talks.

[00:33:09] Janez Urevc: Um, as always, we'd love to hear your feedback and any topic suggestions. You can write us at ttt@tag1.com. Again, that's three T's for Tag1 TeamTalks. A big thank you to both our guests and to everyone who tuned in.

[00:33:29] Janez Urevc: Thank you for joining us.

[00:33:31]

Drupal Migration Series

Transcript: Unraveling the ETL Data Migration Process: Understanding Extract