By now, you have probably heard about a major bug in Intel (and ARM, and maybe AMD) processors. Since this vulnerability affects all processors types, you will probably need to update your phone, your tablet, your PC, and all of your servers in the coming weeks. Why and how does it affect everything?
What is this Bug About?
Modern (since the early 2000s) processors use what’s known as “out-of-order” execution. This is somewhat similar to the way SQL Server does read-ahead–in order to improve performance the CPU will execute a series of instructions before the first one is necessary complete. The way this bug works is that an attacker can pass code that will fail, but using some trickery and the magic of caches, build and retrieve (at up to 500 kb/s ) all of the kernel memory. This means things like passwords, that are normally secured in kernel memory can be stolen very easily by an attacker.
Should I Patch?
This is deadly serious bug, that is easy to exploit, with a ton of vectors. You should go through your normal test cycle with patch validation, but applying these patches should be a key priority on the coming weeks.
For a comprehensive listing of patches Allan Hirt (b|t) of SQLHA has put together the most comprehensive reference I’ve seen.
The biggest impact thing that I’ve gleaned from reading all of this stuff, is that if you are running VMWare ESX 5.5, the patches that VMWare has released for that version are not comprehensive, so you will need to likely upgrade to ESX 6 or higher.
Will This Impact My Performance?
Probably–especially If you are running on virtual hardware. For workloads on bare metal, the security risk is much lower, so Microsoft is offering a registry option to not include the microcode fixes. Longer term especially if you are audited, or allow application code to run on your database servers, you will need to enable the microcode options.
This will likely get better over time as software patches are released, that are better optimized to make fewer calls. Ultimately, this will need to fixed on the hardware side, and we will need a new generation of hardware to completely solve the security issue with a minimum impact.
What Do I Need to Patch?
Just about everything. If you are running in a Platform as a Service (Azure SQL DW/DB) you are lucky–it’s pretty much done for you. If you are running in a cloud provider, the hypervisor will be patched for you (it already has if you’re on Azure, surprise–hope you had availability sets configured), however you will still need to patch you guest VMs. You will also need to patch SQL Server–see this document from Microsoft. (also, note a whole bunch of new patches were released today.
What Other Things Do I Need to Think About?
This is just from a SQL Server perspective, but there are a few things that are a really big deal in terms of security now until you have everything patched:
- Linked Servers–If you have linked servers, you should assume an attacker can read the memory of a linked server by running code from the original box
- VMs–until everything is patched, an attacker can do all sorts of bad things from the hypervisor level
- CLR/Python/R–By using external code from SQL Server an attacker can potentially attack system memory.
In the last two days, I’ve been part of two discussions, one of which was about the need to run CHECKDB on modern storage (yes, the answer is always yes, and twice on Sundays), and then another about problems with third party backup utilities. Both of these discussions were born out of (at the end of the day) infrastructure teams wanting to treat database servers like web, application, and file servers, and have one tool to manage them all. Sadly, the world isn’t that simple, and database servers (I’m writing this generically, because I’ve seen this issue crop up with both Oracle and SQL Server). Here’s how it happens, invariably your infrastructure team has moved to a new storage product, or bought an add on for a virtualization platform, that will meet all of their needs. In one place, with one tool.
So what’s the problem with this? As a DBA you lose visibility into the solution. Your backups are no longer .bak and .trn files, instead, they are off in some mystical repository. You have to learn a new tool, to do the most critical part of your job (recovering data), and maybe you don’t have the control over your backups that you might otherwise have. Want to stripe backups across multiple files for speed? Can’t do that. Want to do more granular recovery? There’s no option in the tool for that. Or my favorite one—want to do page level recovery? Good luck getting that from a infrastructure backup tool. I’m not saying all 3rd party backup tools are bad—if you buy one from a database specific vendor like RedGate, Idera, or Quest, you can get value-added features in conjunction with your native ones.
BTW, just an FYI, if you are using a 3rd party backup tool, and something goes terribly sideways, don’t bother calling Microsoft CSS, as they will direct you to the vendor of that software, since they don’t have the wherewithal to support solutions they didn’t write.
Most of the bad tools I’m referring to, operate at the storage layer by taking VSS snapshots of the database after quickly freezing the I/O. In the best cases, this is non-consequential, Microsoft let’s you do it in Azure (in fact it’s a way to get instant file initialization on a transaction log file, I’ll write about that next week). However, in some cases these tools can have faults, or take too long to complete a snapshot, and that can do things like cause an availability group to failover, or in the worst case, corrupt the underlying database, while taking a backup, which is pretty ironic.
While snapshot backups can be a good thing for very large databases, in most cases with a good I/O subsystem, and backup tuning (using multiple files, increasing transfer size) you can backup very large databases in a normal window. I manage a system that backs up 20 TB every day with Ola Hallengren’s scripts, and not even storage magic. Not all of these storage based solutions are bad, but as your move to larger vendors who are further and further removed from what SQL Server or Oracle are, you will likely run into problems. So ask a lot of questions, and ask for plenty of testing time.
So if you don’t like the answers you get, or the results of your testing, what do you do? The place to make the arguments are to the business team for the applications you support. Don’t do this without merit to your argument—you don’t want to unnecessarily burn a bridge with the infrastructure folks, but at the end of the day your backups, and more importantly your recovery IS YOUR JOB AS A DBA, and you need a way to get the best outcome for your business. So make the argument to your business unit, that “Insert 3rd Party Snapshot Magic” here isn’t a good data protection solution and have them raise to the infrastructure management.
Something that has come from Microsoft in the last couple of months is the ability to provision an Azure Virtual Machine (VM) with fewer CPUs, than the machine has allocated to it. For example, by default a GS5 VM has 32 CPUs and 448 GB of memory. Let’s say you want to migrate your on-premises SQL Server that has 16 cores and 400 GB of RAM. Well, if you wanted to use a normal GS5, you would have to license an additional 16 cores of SQL Server (since you have all that RAM I’m assuming that you are using Enterprise Edition). Now, with this option, you can get a GS5 with only 16 (or 8) CPUs.
I can hear open source database professionals laughing at turning down CPUs, but this is a reality in many Oracle and SQL Server organizations, so Microsoft is doing us a favor. This doesn’t apply to all VM classes, and currently the pricing calculator does not show these options, however they are in the Azure portal when you select a new VM for creation. The costs of these VMs are the same as the fully allocated ones, though if you are renting your SQL Server licenses through Azure those costs are less.
This option is available on the D, E, G, and M series VMs, and mainly on the larger sizes in those series. If you would like to see cores reduced on other VM sizes, send feedback to Microsoft.
I did a precon at Philly Dot Net’s Code Camp, and I promised that I would share the resources I talked about during the event. I’d like to thank everyone who attended the audience was attentive and had lots of great questions. So anyway here goes in terms of resources that we talked about during the session:
1) Automated build scripts for installing SQL Server.
2) Solarwinds free Database Performance Analyzer
2) SentryOne Plan Explorer
3) Glenn Berry’s Diagnostic Scripts- (B|T) These were the queries I was using to look at things like Page Life Expectancy, Disk Latency on the plan cache, Glenn’s scripts are fantastic and he does and excellent job of keeping them up to date.
4) SP_WhoIsActive this script is from Adam Machanic (b|t) and will show you what’s going on a server at any point in time. You have the option of logging this to a table as part of an automated process.
5) Here’s the link to the Jim Gray white paper about storing files in a database. It’s a long read, but from one of the best.
Finally, my slides are here. Thanks everyone for attending and thanks to Philly Dot Net for a great conference.