Real World Serverless with theburningmonk

#31: Stop forum spammers at the edge with CloudFlare Workers

September 30, 2020 Yan Cui Season 1 Episode 31
Real World Serverless with theburningmonk
#31: Stop forum spammers at the edge with CloudFlare Workers
Show Notes Transcript

For more stories about real-world use of serverless technologies, please follow us on Twitter as @RealWorldSls and subscribe to this podcast.


Opening theme song:
Cheery Monday by Kevin MacLeod
Link: https://incompetech.filmmusic.io/song/3495-cheery-monday
License: http://creativecommons.org/licenses/by/4.0

Yan Cui: 00:13  

Hi, welcome back to another episode of Real World Serverless, a podcast where I speak with real world practitioners and get their stories from the trenches. Today I'm joined by Paul from StopForumSpam. Hi, welcome to the show.


Paul: 00:26  

Great, thanks for having me. 


Yan Cui: 00:28  

So Paul, you've got some really interesting stories to tell about the CloudFlare Workers. But before we get into that, can you tell the audience about your work with StopForumSpam, what it is and what do you guys are doing?


Paul: 00:42  

Sure. The project started as a learning to code experience about 14 years ago now. And it started off as providing a very simple API to half a dozen a dozen forums, so they could pull the information about email addresses and IP addresses that have been spamming the forum so that the information shared could then be used by other forums to block registrations, and slowly grew from 5000 requests a day to its current request rate of about 350 requests a second over the period of 14 years and there's been some fairly serious architectural changes along the way as, as was needed necessity. We just couldn't handle the load coming through. At the moment we have 300,000 clients and they are welcome to crowd share these spammers that are hitting them, and for the benefit of the entire infrastructure, the clients. It's been getting very busy and as we've moved through each engineering phase. We've come to the stage now where we've migrated the entire lot from running and VPS servers throughout Europe and America to now running entirely on the CloudFlare workers serverless framework and being really excited to see therefore, that pick up and go and without any, without any issue as well.


Yan Cui: 02:18  

Okay. Very cool. So, can you maybe tell us about how the architecture looked like before. You had the servers running in the US regions as well as in the European regions. What other things do you have in terms of database in terms of load balancers and so on?


Paul: 02:35  

Sure. If we go back a couple of years to its previous iteration, which would have been running for about five years before the migration, there had been one server, with the main server in Europe. And that is running Percona SQL MySQL 5.6 for several years, fronted by a load balancer really just for fault tolerance to a couple of nginx servers running PHP, all hand coded PHP and no framework at all because at the time we started writing it, there wasn't much in the way of frameworks that I was happy with anyway so it was just pure PHP. Data was initially coming out of my SQL. However, as the requests that started increasing and more and more people started using it SQL just couldn't handle the load. So we pushed it into a caching layer with memcached and eventually they couldn't handle the load, and then pushed into Redis. And at that point we were quite happy with Redis because it allowed us to bring up a couple of more VPS nodes in Europe and US and then use the internal Redis replication in order to push the API data around in real time. With some work on Redis were able to get the memory usage down from its initial 800 megabytes of usage down to 160, 180 megabytes per node which bought the VPS cost down quite a lot so you could run a API node on a $5 node, which is really quite handy. It's a free service we're not funded by anyone, it's all just out of my own pocket. So keeping the cost down was, you know, very beneficial to me as well. So we use the custom version of Redis, which added a couple of commands for doing IP space searches. And that's how it ran for about five years. And then the infrastructure started to creek around the edges and I had a couple of servers fall over. Put the whole lot behind CloudFlare after being hammered with denial of service attacks used [??], from what is now a GoDaddy service to do geographic load balancing, and they really kept the system very stable for quite a while, but there was a lot of hopping between networks and I was just not quite happy with it. So when CloudFlare announced serverless I thought, well, this seems like a good platform it runs on the, on the network edge or runs close to your server. And with clients hitting the server quite frequently latency is a real issue. So if I can keep the latency down, then I'm not going to be impacting users. If StopForumSpam went down, I could see hundreds of thousands of sites having issues and I really don't want that to happen. So, I ploughed ahead and ported the PHP code that was running the API to JavaScript and started tinkering around with learning JavaScript and learning serverless, the way that it worked, the issues with it, the pros and cons, and then moved into the data migration. There was issues with data sharding how's the best strategy for moving data between where it is on my SQL to where it is and the primary key store and CloudFlare and came into a nice balance of trying to keep the traffic down between the nodes in the data store, while still keeping the latencies right down. And after a couple of months of tinkering with the code I was quite happy enough to give a guy, a look at it. He came up with some comments and we worked on some code revisions, and then deployed. No one was the wiser, there was no report of issues or API inconsistencies. I was really quite happy and it's been running very well.


Yan Cui: 6:23  

That's such a really interesting story of migration from using more traditional, I guess, serverful implementations to now entirely running on edge based workers with your key value store. Before the listeners who are not familiar with how CloudFlare workers work and how the KVs, key value store works. Can you maybe just give us a quick explanation for what that looks like in terms of the architecture, also how to write the code and deploy them.


Paul: 6:52  

Sure. The CloudFlare serverless will deploy your code to run on the edge network on any of the CloudFlare point of presences and is coming on 170 I believe so. If you have a server in Los Angeles in your request an API query is going to be served from the Los Angeles server. Same for Singapore in Singapore, Auckland in New Zealand. They keep it locally however the data store is not the data stores in the US. So, if something isn’t cached in the local data centre, the worker has to request it from the main server in the US, that could add latency. If you're requesting a lot of data, which is why I spent such a long time on trying to share the data in such a way that your small request or data centre in Auckland isn't going to result in two three megabytes of data being hauled down from Los Angeles. At the moment it has kept it down to about two kilobytes which with compression as a single packet over the Argo tunnel between the sites. If data isn't cached, then it's pulled from Los Angeles and incached and I'm also doing process caching, and trying to keep everything in the data centre as long as I can before it's expunged and the next request comes through and has to go back to Los Angeles. Unfortunately you can't really see anything in the CloudFlare matrix to see what amount of data is actually being kept locally in process in data centre, all the time being pulled from CloudFlare in the US. So, it really was just trial and error, pulling a large shard, lots of data, what was the latency. And you reduce the shard size but increase the amount of shots stored in KV and encounter this nice metric of, you know, this works well and isn't going to be a manageability issue. If I just done one primary key per record there would be millions sitting in KV then you end up with an issue with being unable to populate KV because the CloudFlare API allows you to seek 10,000 keys in a request, and a 50 megabyte request limit, it would take four, five weeks to process that many keys. So, there is a nice balance there at the moment and I'm very quite happy with it. For deploys, this is where CloudFlare actions comes into effect, my SQL pushes all the data it needs to on a cron basis to JSON, which triggers a CloudFlare worker, which validates the request and then pushes it to GitHub, and then GitHub runs an action which pushes the JSON into the code base and deploys the API, with an updated JSON which includes the global identifiers like the white lists, the deny lists for domains and IP addresses and ranges that aren't dynamically updated in the API. And as with every anti-spam service, data that is an hour out of date is stale data. The data is refreshed every 10 seconds and the KV just to make sure that the moment a bad email address is listed that is available as quickly as possible to the customers or to the clients anyway. So the GitHub workers, sorry, the GitHub actions have really been invaluable and pushing static data out hourly to CloudFlare workers. And so it just triggers an action GitHub compiles the application and then deploys it using the CloudFlare API is really very quite seamless and very easy to get it done. I'm not a developer, well, I don’t do it professionally, so this is a bit of a pet project as well so it was all learning but it was just done so easily. It really was.


Yan Cui: 10:40  

Okay. And do you have any idea in terms of how quickly those CloudFlare workers running in all those different edge locations get updated, once you deploy an update to them.


Paul: 10:53  

The KV is, if you make an update to KV, it's a non atomic, which is what I also had to work around in the code that takes about 10 to 20 seconds, from the moment it hits my SQL to the moment it's available in the KV store. The code itself takes about a minute to compile from the moment the action gets triggered to to the moments deployed. And then it's available within the age of the data centres within about 10 to 15 seconds, from what I can tell, timing-wise and testing. It doesn't take too long.


Yan Cui: 11:29  

Okay, that's pretty good. That's really good compared to some of the other alternatives. Within AWS for example you have Lambda Edge and that takes a good few minutes when updates are reflected in all the different edge locations. Another question I've got about CloudFlare Workers is that you mentioned that you're doing a lot of caching in process. So I guess that implies that the process that's running your worker code is long-lived. Do you have any idea in sense of how long that process runs on for?


Paul: 11:58  

I spent a long time trying to figure that one out down to the millisecond however, as you heard a low balance interface you had an endpoint within CloudFlare it could go to any of the servers in the data centre, the best I can tell is that some of the processes running under load last minutes before they refreshed. The data centres that aren't running that much in the way of API code requests like here in New Zealand. It was some of the data centres maybe in North Africa. They will last about 30 seconds before the process is expunged. So keeping everything in process I can, I can keep the request down to KV to quite a lot.


Yan Cui: 12:45  

Okay. And what about in terms of the size of the code for the worker is there any limit on how big or how long your code can run for?


Paul: 12:54  

Yea, there is. And I hit that in trying to reduce KV lookup so I thought I'd be smart and put a Bloom filter in place before going to KV to get data however they pushed the request above 50 milliseconds, which is the current execution limit on the tier that I have in CloudFlare now there is a higher tier in the enterprise and there is a lower tier on the, on the free service that CloudFlare provide, but I was getting issues with code executing over and not dying after 50 milliseconds of CPU time, not actually, not real time CPU time. So eventually I had to pull the Bloom filters out of the code, and with a little bit more in process caching. At the moment I'm sort of running the 99.9 percentile request at 17 milliseconds which I'm quite happy with, 99 percentile being about four milliseconds. So you... with the current CloudFlare workers, you have to be very weary of the CPU limits that you do have at your tier. CloudFlare have announced the beta of their, of the new unlimited, unbound, sorry CloudFlare workers unbound, which has a different costing model, being based on per millisecond of execution so you can run longer running code on their platform but I've not put my code into to look at it, I'd be very interested, though, that way I might be able to reduce some of the KV queries and would expand on the next version of the API which I'm currently working on at the moment. I am also excited to have native workers and will be coded for workers, it won't be a port of 13 years of nightmare that I've been, you know, sneaking in the air and wishing I hadn't.


Yan Cui: 14:46  

Yea, I did remember reading that announcement post, I guess a few weeks ago now, when they first announced the workers unbound. It does look very cool. I… still though 99.9 percentile at the 17 milliseconds that's pretty good going.


Paul: 15:03  

And it's going to come down as well when as CloudFlare have merged or integrated SSL certificate negotiation with the worker startup process. Previously, you'd go through the whole TLS negotiation process and then your worker would be spun up. Now they are spinning the workers up as a part of the TLS negotiation process. So then, that was great because they reduce latency quite a bit.


Yan Cui: 15:31  

Okay, that's good to know. Is that something that's out already, or are they still working on that?


Paul: 15:37  

No, there's something... I believe it's out now.


Yan Cui: 15:39  

Okay. Very cool. And the one thing I do have in mind as you were talking about your architecture and how requests flows through your system is how to go about debugging things, how do you go about, you know, the tracing problems, checking your logs. 


Paul: 15:52  

Yeah. Initially, it started with using the debug functionality and CloudFlare editor in the workers editor, with debug in the console.log. It wasn't until the CloudFlare Wrangler came along that you could spin up local development environments, which would allow you to provide us a little bit more in the way of Node debugging. I'm not a proficient Node coder, but I did get into the Node debugging in order to look at some things. They were really starting to get on my nerves. But for the, for the best part of it, I hate to say it but console.log actually goes quite a way. 


Yan Cui: 16:39  

So do the older logs that you print out with console.log gets centralised into one place in the CloudFlare workers? I guess Management Console already for you?


Paul: 16:49  

No, that all...they just go to a dev node. So, I have very little in the way of debug logs, as the process throws an exception then I get an HTTP push through to a log collector, which I can see. Get a stack dump and a copy of the URL parameters, so that I can then back trace it. But so far, other than people running pentesting vulnerability scanning on the API which triggers pages and pages of logs, I'm not seeing anything in the way of errors coming through from the API. There’s the occasional exceptions thrown from the CloudFlare storage engine, but they are very infrequent and impossible to replicate.


Yan Cui: 17:33  

Okay, cool. So I think that's everything I wanted to ask about CloudFlare Workers, but you mentioned that you've got a new project you're working on, or is that a new codebase for StopForumSpam.


Paul: 17:44  

For the API, yes. At the moment, version one API is, is 13, 14 years old and I've been adding things today as people have been requesting them. But as with everything that grows organically, you wish you could go back and revisit it and rebuild it but I just didn't have the time so I thought API version two I'm just going to rebase the entire code that's going to be JSON only instead of some of the more colourful serialisation formats that API version one supports. I'm going to deprecate API version one but I'm never going to remove it. There are just too many people out there that rely on an old codebase unsupported. And I think there would just be cruel of me. So version two is going to be entirely on CloudFlare Workers, people that are running version one need not worry about the API becoming unavailable it's just not going to happen it's going to remain free for everyone to use for eternity. I'm looking at how API version two could use GraphQL, how we can personalise it more for a per-user experience, if these functionality that we can put into the API. So that in the existing integrations with software that you can't edit. We could extend that. So, people are bouncing ideas off all the time I'm chatting to people on on the StopForumSpam site, you know, got a great, great community there of people that volunteer their time as much as I volunteer my time to it so it's all, it's all been a great experience to think has come a long way from, how can I learn to code PHP and MySQL I know let's play around with this to something that's serving 300,000 clients. The best part of seven or eight billion requests a year in a terabyte of traffic come out, that's it's come a long way and I want to see it grow even more.


Yan Cui: 19:38  

That sounds pretty exciting. And also I guess I've not heard anyone running GraphQL in a worker, in CloudFlare worker or any sort of edge workers, for that matter, so I'm really quite interested to see how that project pans out. But I guess I do want to ask you about this, this new version in sense of what's the sort of the rollout plan, because 300,000 clients that’s a huge number of people that have to upgrade to a new API. Do you have like some idea in terms of how you are going to roll out this new version?


Paul: 20:09  

It's just going to be rolled out on a different endpoint, a different REST endpoint for the moment. Version one is not going to change. The codebase for version one isn't going to change, it's just going to use the CloudFlare workers routing to route the new API, the new version requests to a different, to a different worker, so the two are run completely independently of each other, and if you don't want to upgrade your API version one you won't have to, I'm not going to mandate that at all. I'm just hoping that version two is going to provide enough in the way of new features and interesting features that people go you know that sounds like a really good thing. Let me just change the endpoint and use JSON. And if you don't want to use the new features then I'm trying to make version two as much as backwards compatible as I can with version one. So if, if you don't want the new features you just ignore them in the serialised output. If you want to use GraphQL I'm still playing with it. I had a friend tease me about implementing it. And I discounted it, but I can't help it, it's sitting in there in the back of my mind now so I probably end up deploying a GraphQL version of the endpoint as well just to keep them happy and stop nagging me. But the codebase is coming along slowly. I'm doing it on my spare time and the whole projects, a hobby as such. And so, I tinker around on the API code and the features in my spare time and on the weekends in the evenings. And so there is no real time scale or certainly you'd want to have API vision two deploy before the end of the year are certainly available to people to test and provide, not so much functional testing but an integration testing feedback about the features, testing their features and seeing if they really work in the environments that most people are going to be using it in which is protecting the forum's blogs, from the scourge of criminal spam. It's not used exclusively in those that's used by small companies to prevent fraud on the websites, you get these emails from, from small companies, saying, we implemented your API in this way and never thought would be done that way but they said, “look, you've saved us hundreds of thousands of pounds and fraud every year”. Thanks very much, it's things like that they really want to make me keep doing this.


Yan Cui: 22:39  

That's amazing. And also, I guess. Good luck with the new version of your API. Sounds like it's going to be a lot of fun up ahead for you in terms of learning but also just building stuff. How about in terms of the cost because you're volunteering your time but also, like I said you're paying for these things yourself as well. What's the cost looking like for CloudFlare Workers so far?


Paul: 23:02  

Well, the whole costs are pretty good. [??] in the Netherlands provides me a server for free, and they have for several years now. And big thank you to [??] and particularly Leslie who keeps it running for me. Linode provided some VPS servers as well. And CloudFlare provided the Workers as a part of their For the Greater Good project. So at the moment it's serving 30 million requests a day, and on the costing model that is about 25 US dollars a day. So, if there was a, if there was a financial backing and I had budget in order to work on whatever infrastructure I wanted then, you know, CloudFlare serverless is just brilliant. I don't see why anyone wouldn't run an API on it if they're running your business. As a project for the better good, in my spare time. If CloudFlare hadn't stepped up and offered a service as they had, then it would still be running on VPS servers. You know, I managed to get the code running on a $5 Linode VPS servers and they were rock solid, so I could have had five of those running in Europe and in the US, and, and just kept Redis. And so instead of $25 a day would have been $25, a month. But there were so many customers, so many people using it I had to provide more in the way of high availability and redundancy and fault tolerance and CloudFlare Workers does it. So using it.  


Yan Cui: 24:41  

Okay, very good to know and good, really good of them to step up to the plate and help out the community by paying for infrastructure for your project.


Paul: 24:50  

It's been phenomenally awesome. I couldn't have done it without them.


Yan Cui: 24:54  

All right, so I guess that's the end of all the questions I've had. Is there anything else that you'd like to tell the listeners before we go? Are there any personal projects that you want to mention?


Paul: 25:03  

This is, this is pretty much what I'm working on this and much in my spare time. And I'm hoping to get more into the serverless environment in the serverless community here in New Zealand. Unfortunately, COVID has kind of put a bit of a damper on that, and I was shortlisted for a ten minutes lightning talk at the serverless Auckland conference, which has been cancelled or delayed next year, so I'm hoping that they're playing around and CloudFlare serverless will give me a bit more experience so I can go to that next year, and provide a bit of knowledge to people who are interested in hearing about either project or CloudFlare serverless. There are small... they have a small market share, the giants of Azure with functions and Lambda and AWS and Google. But, you know, there are other one, other serverless providers out there that provide just as good a service. And so, the more I'm learning, the more I can compare. I want to have a look more at AWS Lambda. So I'm just naturally curious about it all. So it seems like a fantastic emerging technology, one that I'm interested in. So you know why not turn the personal interest into professional interest as well.


Yan Cui: 26:23   

Excellent and looking forward to, when, to COVID going away and being able to see your talk at the serverless days conference. 


Paul: 26:32  

Oh, it'll be online. I'll send you an invite. 


Yan Cui: 26:36 

Great. Sounds good. Perfect. And with that, thank you again for joining us today and sharing your story, and I hope you stay safe and hopefully see you sometime.


Paul: 26:27  

I look forward to COVID being over and getting back to Europe to, to see tech and do tech and talk tech as well, because I spent many years there and have a great love for the place. It'd be good to get out there and see and talk to some familiar faces. So, thank you. 


Yan Cui: 27:05 

Okay. Bye. Bye. Take care. 


Paul: 27:07  

Thanks.


Yan Cui: 27:20  

That's it for another episode of Real World Service. To access the show notes and the transcript, please go to realworldserverless.com, and I'll see you guys next time.