This is a transcript. For the video, see Using Goose to troubleshoot Quick Edit.
Michael Meyers: [00:00:00] Hello, and welcome to Tag1 Team Talks, the podcast and blog of Tag1 Consulting. Today, we're going to be talking more about finding and fixing problems with Goose, an open source load testing framework that was inspired by Locust and written in Rust. It's the most scalable framework available.
It is the easiest to scale. You can check out all of our talks on Goose at Tag1.com/Goose. Today's talk is a really great example of why you should load tests over long periods of time and not just short runs, as well as why you should be performing edit actions on, you know, creating content and performing editing actions on the content that you create.
During our last talk on Goose, we explored the practical side of load testing, and we walked through how to analyze your load testing results, using New Relic so that you can find problems that are impacting your website and address them. We load tested against the Drupal framework and CMS with the default umami install and in the process, we did discover a problem.
Uh, the problem is with Quick Edit, Quick Edit allows you to quickly edit your content without having to leave the context of the page and going into full edit mode. What we found in running a load test over a long period of time and creating and editing content is that when you have a lot of revisions, load times get insanely slow for end-users.
We're going to walk through how we use the local container to track down the bug, fix it, and then confirm the fix with follow up load tests.
I'm Michael Meyers, managing director of Tag1 Consulting. And joining me today are two performance and scalability experts. Jeremy Andrews, the founder and CEO of Tag1 and the original creator of Goose and Fabian Franz our VP of technology. Who's been a major contributor to Goose, uh, and a really well known performance and scalability expert.
Uh, Fabian is responsible for a lot of the performance and scalability in Drupal. Jeremy, Fabian, thank you guys so much for joining. I'm excited [00:02:00] that we, uh, found a bug in our, in our load test. Not that there is a bug, but that, you know, we set out to do a load test and sure enough, uh, we found one, uh, and I thought, you know, I learned so much in how to track things down with New Relic.
It's, it's this amazing tool. Uh, Fabian, could you tell us, you know, uh, just rewind a little bit, tell folks how we discovered the bug, you know, what jumped out and what the next steps were.
Fabian Franz: So essentially what we did, um, on New Relic, we, um, tested our beautiful Umami install and did that for quite awhile.
So we run it for a few hours, both for testing the load test. Um, so that we had good demos and, um, then later just to demonstrate it and we found that over time it gets slower. Um, but we also found, um, that we had several slow responses always there. Things like 27 seconds. And then we found that all those responses, especially, um, [00:03:00] when we looked at a partial trace and New Relic and we found this one function, quick edit entity view all the time.
Um, so what apparently quick edit did something here. So, uh, let me share my screen. I'm gonna, uh, gonna start here? This is Umami. I'm logged in as an admin and just to show real quick what Quick Edit is. It's so I can just edit this and then save it if I want to, or I can just start my changes.
That's quick edit in a nutshell. Um, the bug we found actually is pretty well-known. Um, it's a regression from 2018, was created 17, March, 2018, and, uh, it still needs work. Uh, it's still not fixed in Drupal 9.3.x and it's a critical part of the entity system. And here we can already see some people have already debugged something, but we are not going to look at that because when I started debugging, I didn't [00:04:00] have that.
So let's assume you don't. So, um, the first thing I did was, I ran a Goose load test against my umami instance, but surprise, the problem that was so apparent on New Relic was just not to be found. So, um, then I thought maybe it's something that has to do, that just happens after a while. And so I did leave a Goose running overnight and, uh, came back the next morning, and voila, the problem was there.
Then I looked what has changed essentially. And what has changed is that whenever within a standard Umami profile an article is edited. Um, you create a new revision of it. So, here we have, several revisions created by admin, uh, which are created by the load test and one originally as part of this profile.
Okay. So it had [00:05:00] something to do with revisions. That was a good point. So, what I then did was, I created kind of a little, um, a shell script. There was just, deprecating one revision entry, um, because I wanted to really analyse the problem. Turns out 100,000 revisions it's way too much. We'll later see, what's that about, but essentially in the end I've - I want to have, um, 500, uh, revisions.
So what we can do is we could count the number of revisions from a node revision. Um, group by node id. So then, and we see, um, many, you just have little ones, so this is no longer the installation I let run Goose overnight, but this is a new installation. And then just input like around 6,000 revisions, um, just to -------- the problem.[00:06:00]
Um, so, this is how is the revision table looks.And then we can essentially start with two things. Uh, we can start from the code level and just look at what's calling Quick Edit into View Alter. What does set calling? What is that calling, what's that calling and go there, or we can now that we control our local, uh, essentially get a trace.
Um, and for that we start a normal load test. Um, just like normal. So we have our Goose here, admin username, password is already in the environment and we are just starting it up and, and, um, because I have so many revisions, Goose, should be able to show me the problem. So while we're waiting for Goose to do that, um, I've also put us a little, uh, nice tool, uh, here, um, into our installation [00:07:00] and we're using, for this whole installation here we're using DDev um, that means again, for example, the DDev Drush and execute Drush commands on this installation, or I could quickly get, uh, get a database connection, via drush sql-cli.
Um, and, um, So DDev is the core installation used. And, uh, but what we really would like is kind of like a local New Relic so that we can, um, essentially record all those, um, runs that are doing. And this is what I did. I installed a tool called XH graphical user interface. And, um, it's a little bit old, but still working at users among the DB container.
And it uses, um, a web container to then show me all the recent ones. And here you can see whenever Goose is visiting a page. I am seeing here my, um, [00:08:00] my traces, but it's really, really nice. And, uh, despite New Relic, which obviously also needs to work this production instances. And we did test on a kind of production instance - our load test last time.
Uh, We only get partial traces unless we configure, especially, um, here I get full traces because I don't care as much about the performance and more about the relative performance. So we can already see, do we have something with the longest waittime? Yes, we have. So here we have one that took 22 second, 22nd, 19 seconds, 18 seconds.
17 seconds. Okay. So, uh, always the same article, funnily enough. Uh, okay. So, probably one that gets ended frequently. Um, we could now, uh, look into one of those ones and see already, oh, it's something with the database that takes so long, the remainder is kind of, not [00:09:00] really. And you can see this graphical user interface a lot more raw than New Relic in that, but we can still get a, get a pretty nice idea.
So, um, there's two ways we can now, essentially get to, oh there's some nice graphs. Um, who's using what memory, et cetera. Um, and it could essentially have two ways to go for what now I can, can go to, who's calling the statement for, by execute and taking too long. Then I can look up to the query, can look up to the execute and goes through that all.
Or I can do something a little bit more smart because I, know this is just one, one function that is, um, is so, um, hard. So it should be able to sort by it, in theory at least.
Um, but apparently I can't . Okay. Um, then we need [00:10:00] to go, uh, just click through.
So we just need to always look at where the largest results and that doesn't take long and function execute again.
The parent function of that. And now we see this query function is not called by much. There's just as good default revision ID from content moderation. Then they assess moderation state five item. Then there's, get latest revision idea and get latest translation, get latest translation affected revision in here, but it's kind of interesting, but we see already our bugger is here, get later to revision ID.
So, um, and this is called by entity type is revisionable. Um, and, um, this takes very long apparently. Okay.
So,now we wanna want to look [00:11:00] at, um, Oh, and this is it's even called by this parent functions in certain contexts, but it can also be called itself. So this would be the view from internal. How do you get this nice graphical user interface, uh, on here? Um, right now this is just a pull request, um, in the DDEV contrib, uh, repository, it's a pull 128.
Um, and this is essentially, it has a nice recipe of how you can set it up. The first thing you need to do is in your DDEV config YAML file. You need to add this web image, extra packages for each path, but I've been told that in future DDEV versions, um, xhprof will be available by default, similar to XD back.
So that's a great win ready for open source, because there's something else proprietary supported, out of the box, also. [00:12:00] Um, then we copied yourself a folders to our data folder. Essentially you just follow this recipe and then for the application, um, I've used the WordPress way because some of the, uh, Drupal way, the Composer way didn't work for me.
So just download this PHP profile here and required both those files and there I was up and running. So let's take a look at how this looks. Um, this is essentially just two lines of code edit here and that's it. By the way, I did the same mistake as we originally did with the old New Relic thing. Um, I didn't had Redis enabled first.
Um, so I, uh, quickly mitigate that by adding Redis here as well. Um, we have a standard Pantheon information, Uh, just because, uh, before it was kind of screwing up my database statistics because was having way more load the database than I expected. So I'm [00:13:00] not going to do that. The other way, how we could've ...
Jeremy Andrews: That was setting PHP, correct? That you were just looking at?
Fabian Franz: uh, penalty questionnaire again?
Jeremy Andrews: That was settings.php that you were looking at.
Fabian Franz: Yeah, this is settings PHP. So we are looking at, um, site's default settings, PHP, then settings page from Drupal. And this is just where I am. Um, just before DDEV includes its settings. I put this auto load here.
I still hope, um, that this in DDEV itself we can do some little tricks of either using pre-load. Or adding it automatically to settings DDEV PHP, if you do, xhprofs on and off like that, you can enable it by default. And, um, then we can have some little, um, helper script. Um, that is not a full-blown graphical user interface if all you want out on traces, essentially.
Um, so, [00:14:00] um, but this is a great solution that works very well with Goose. Our Goose Attack is still running, um, as also printing statistics from time to time. Um, but not, we have not yet seen the, um, the maximum, so I'm gonna stop this here now. And, uh, once Goose has stopped, we can see clearly see here we have an anon request, and it took 23 seconds.
So despite in theory, um, this thing should only be affecting, um, um, authenticated users because cricket is only there. Apparently we have traces where, um, this is affecting anonymous users as well. So that's pretty interesting and makes a critical bug, even more critical. Um, this is something that Goose found this didn't show up in that trace as clearly in, in New Relic.
Um, and, [00:15:00] um, uh, when we are testing on Pantheon not on a local, we also have the CDN, which is pushing a lot of load off. So it wasn't as prevalent here, but Goose can clearly show us, um, that the authenticated user here in this node called for posting the authentication, getting the authenticate, um, edit form, uh, there's all performance problems.
Yeah. So, um, this is, um, really, really nice of Goose of, of showing us all this and, uh, We can see that even that hour average is still, uh, pretty okay. Uh, for many things. Um, but it's a maximum where we have the real problems and, uh, Goose shows it directly. So, um, if you hadn't a nice tool like Goose, um, how would you go ahead in doing this?
So you would go to quick edit, entity view alter and, uh, you would need to know. And [00:16:00] if I remember correctly, but even saw this, as this latest revision in the New Relic, uh, screenshot. So, um, okay. I mean, now it's, it's this, this latest revision, what's this latest revision even about? There was something I asked and even questioned core about.
Core has the idea of forward revisions. So you have a published revision and any revision created afterwards is a so-called forward revision, which usually are drafts of the content, et cetera. But if you are in this mode, and apparently you want to quick edit. You never want to edit, um, the published version, but only the latest revision, essentially. Even if what you are seeing right now is the published version because it hasn't been published yet.
So, um, this is all a little bit complicated because Core, um, except if you use workspaces spaces, which you really [00:17:00] should, because then it's all clear of what belongs to what, because then if you change the workspace, It's clear you are editing in this workspace and even changed to publish. It's clear you are editing published, but here it's what would this edit essentially?
What is this edit the draft?
Jeremy Andrews: Clear to me?
What you just said was confusing. Are you saying that we're editing the currently published or editing? Not currently published right now?
Fabian Franz: The trick is whenever, um, we are looking at a live version, then this quick, quick edit link does not want to show, because it would be confusing if you edit it.
And then it's a completely different title in the edit mode then in this.
Jeremy Andrews: So quick edit. It's only going to show up if you're showing the, the last even unpublished, um, revision?. Is that what you're saying?
Fabian Franz: Yes. If you looking at the latest revision ID, essentially. [00:18:00] Just then will the quick edit link show up. But there could be editors that can't access drafts or a workflows, or like that.
I'm only seeing like the live version and so for those this link just should not show up. So Core has , this idea of a latest revision ID is essentially the latest draft. And, um, the latest revision ID is essentially the maximum revision ID. At least if you don't put translation into the picture, because then it gets even more complicated.
It's a maximum revision ID, uh, in the note revision table, grouped by node ID. So, uh, in theory its, as simple as, um, this. Its not the count, but it's a maximum of vid. So here [00:19:00] we see, um, what the, um, maximum revision ID, uh, is, uh, based on the node ID, maybe we should put the node ID in between, so, and because I've created a lot of revisions once and deleted them.
Uh, we have pretty high node IDs by now. Um, but yeah, this is essentially, uh, what Core expects to do. So what we can also do is we can write a little test, um, now that, uh, Goose is, is no longer here, um, we can write a little performance test of essentially just running this part. So, um, what I like to do is I called it scr.php.
And the reason is because I can just do DDEV SSH. And I go to the directory where I have set, and then I do Drush SCR, which is short for script. And then I [00:20:00] execute this little script and I can even put a time before it, and after sometime, uh, we'll be having a result of what the latest revision ID for node 16 is.
Uh, and it's correct. That's certainly correct. Um, we can see that here. Um, but we can also see it took 12 seconds and this one nicely enough is also showing up, uh, was the URL of slash um, this could be improved a little bit of like, for example, showing the command line parameters. If it's on the command line, we can see it took, um, 12 seconds here.
We can take a look at the one and, uh, nicely enough. Of course, it's data may statement for back again. Uh, what's making the problem and get latest revision ID as we already know.[00:21:00]
So, yeah. Um, so we found our troublemaker essentially. Um, now we need to find it in the code. Um, now that we've looked at it from all the stages. So such a script, it's very, very useful for testing this because you don't want to, especially with complex scenarios, you want to really just performance test,
This part that is so slow. You don't want to performance test, everything essentially. So, um, from Quick Edit, it, we would then, uh, searching for we would to go to its latest revision. Um, For the anti interface find that either via IDE or via grepping. Again, this is calling get latest revision ID here. And, um, and from there, there we go, go further to the front NT storage space, whereas the get latest to revision ID is implemented.[00:22:00]
Um, and this is essentially an entity query, which is having the special, latest revision part. And this is what introduced the problem essentially. Um, and, uh, for the get latest translation effective to revision ID. They have some other special key here, but they're essentially doing a range in there doing a sort, uh, for the revision.
Um, and the reason for that is in using all of the revisions instead of latest revision; in the original quick edit code, use this part as well. They also use this kind of sort, et cetera, and it was fast and it was nice. Just when Core introduced this generic interface, that is when things got really slow.
And the reason for that we see in the last part where the actual query is, and we see there's a left join, a self join on the table. Where we look at all the revisions being smaller and then being as null. And that prepared that for us here. [00:23:00] Um, and again, we can see this takes like a long time, uh, like five seconds here.
And the reason why this is so slow, um, this kind of query is, and, um, in this case, and explain doesn't even, even give us this information because it says, uh, using where, using index not exists even, um, is, um, because we cannot really optimize based on the idea that we have the node ID already and, uh, essentially its growth quadratic.
Based on the number of nodes and the number of revisions, because the self join on a table and 10 is null, is performance wise, not a good idea, but in problem is, uh, this max revision ID would also not work, uh, due to the fact, um, that we really, really, really want to filter this whole query, uh, based on that.
So the best compromise we [00:24:00] came up this essentially, instead of having all this, um, this, um, left join is to using, uh, instead, um, uh, just, um, an inquiry.
And then when we select, um, again, the max vid from node revision.
Limit leads. Okay. So, and, um, I do the, the mistake. Yeah. Sorry. oh, sorry. Yeah. Uh, and this needs again to be grouped by node ID, and now we have a [00:25:00] result and I'm gonna explain, um, says that at least for, not too many node IDs. Now we are based on the number of node IDs. Um, this view can even be materialized and using an index for group by, and that's already much, much better.
It's using a complexity of the number of nodes, but it's no longer a quadratic complexity. And that's really, really important. Um, the problem is, um, this is still not making any use of the fact that we know the node ID. So essentially, um, if, uh, we give this information to the sub query, which we cannot generically do, um, at, sorry, not base table, that's not now in here.
Um, then, uh, it can just use an index. It doesn't even need to use group buy anymore because, well, there's only one, one NID, so it can essentially change this and then select tables optimized the way it's just, uh, [00:26:00] materialized very directly. So, um, and this is how essentially we, um, a solve this, um, for the core issue.
Um, it's um, let me quickly to change to the branch or let's take a look at the diff.
Probably easier. So what we are doing for gate late to vision ideas, essentially, we put it back to what it was originally. We just get all the revisions. We put a range of 01 on it and then we sort by the revision ID in descending order. And that's fast because that's what the database can very easily do. It's as similar as just doing the max.
Um, the other part is instead of using this join and this wasn't my idea, but someone [00:27:00] else essentially had the idea that we access as a subquery. So we essentially say, uh, the revision fields should be in this max group by expression, which we just ran, uh, essentially this, uh, group by NID. And, um, this isn't optimal.
It would be much better to have a, have a nice mapping table, which you could join for where Core always tracking the latest revision ID of, of any revision. Um, but this is at least much, much better, uh, than we had before. Yeah. And this is not core patching in real time, but at least quite explaining, explaining your corporate and how we can use Goose to, uh, really help us streamline things, plan performance problems.
And, uh, yeah. Even prove that this is not only affecting our same authenticated users, but also anonymous users due to whatever, [00:28:00] but this is Umami standard. So, yeah.
Jeremy Andrews: Are you able to run the load test again with that same database, but with the patch applied so we can see that
it doesn't have that state anymore.
Fabian Franz: That's a very good idea. Let's do that.
Okay.
We have signature.
Michael Meyers: Um, I mean, it's really impressive to me and, you know, makes me understand why we have such a great business. You can talk about how you ran that load test and you saw that there was a problem, but it took a tremendous amount of creative problem solving to actually find the problem.
And then you need to know a ton about performance tuning code to fix the problem. Uh, so you know, it's a pretty multilayered, uh, challenge when you're, when you load testing here.
Fabian Franz: Yes, [00:29:00] it is. Um, performance tuning is not for the weak. It certainly is. It's it's, it's not the problem is you never know. Uh, if you find a problem.
Um, if you can fix it, if what kind of, um, um, structure you need and, um, with many of those parts, um, yeah, performance problems can be really, really tricky to find and exploit, but these are just tools that can help make it easier, et cetera. And then even Core had the right idea. Um, but, there was feedback that the group bike ferry wouldn't be better and we need something much, much better.
And so a good solution was, um, uh, delayed due to a perfect solution, not yet being there. Um, and that's also the little bit, and here you need really need to understand, um, if you're still studying, [00:30:00] study your big 0s, they're so important to know if something is quadratic complexity or linear complexity, it makes a world of difference.
So, yeah, and hopefully this is fast now. So let's see, um, what a longest wait time is. Yep. So longest wait time by now is 2.4 seconds. And we've been running already for quite a while, so we can assume, uh, this is fixed.
Jeremy Andrews: So then if you also canceled the Goose load test, um, you shouldn't see any of those really long.
Fabian Franz: Yep. Probably.
Jeremy Andrews: You can also tell that into the controller and do metrics to not stop it if you want.
Fabian Franz: Yeah. Sorry, exited already, but yeah, there's a
Jeremy Andrews: very cool.
Fabian Franz: Normal max 23 seconds.
Michael Meyers: Is that pretty common, you know, that this sort of [00:31:00] pattern, you know, the complexity of, of finding, um, I'm sure it's hard to say, but is it like 50/50, 80/20? Like what percentage of the time is a problem? Like really obvious and easy to fix and, you know.
Fabian Franz: Most time you find a root cause at some point, um, the trickiest performance problems is if nothing is slow, um, but just the whole thing is slow because it's just for example, too many layers deep.
Um, then the best thing you can do is to add really nice caching, um, because, but if there's something that is really slow, then it usually can be fixed in some way or another.
Michael Meyers: Awesome. Fabian, Jeremy. That was really cool. Thank you for walking us through that. For the folks listening, please be sure to check out our other Goose talks at Tag1.com/Goose. We have lots of blog posts on how to use [00:32:00] the tool, how to find performance problems and how to fix it.
Like today's talk, uh, if you have questions about Goose, please post them to the GitHub issue queues. If you'd like this talk, please remember to subscribe and share it out. You can check out our other Tag1 talks at tag1.com/tagteamtalks. And as always, we'd love your feedback on this episode, as well as topic ideas for future episodes, you can write to us at ttt@tag1.com.
That's a Tag1 Team Talks at tag1.com. And Fabian, thank you so much for joining us. Uh, Thank you to Pantheon for providing hardware to test these Drupal instances and thank you to everyone who tuned in really appreciate you joining us.