Yes, I've noticed the spikes, and I'm aware that the load average one sees in top or uptime isn't always an accurate representation. I also know the relation between i/o load and CPU load, and am very familiar with tuning the linux kernel, interrupts, and CPU affinity. I/O load is not an issue, especially now that I have my caching working: I finally got my battery for the write cache on the RAID controllers so now I have battery-backed write cache...and my load has dropped to below one. My monitoring utilities keep an eye on replication status and alert when it falls behind. Also when this happens I have a script which resets all the reporting functions to use the main database server as opposed to the slave. I've had the replication server go down once actually (bad drive for the root partition, which was an old ssd I found here and not part of the RAID array) so when I got the server back up and got replication started again, it was over 10000 seconds behind. It caught up in a matter of minutes, however, and then reporting was switched back to the slave. Error 1062 can safely be skipped on the slave, and seems to be the source of most of the headaches on getting the slave back up without doing another dump and restoring it/starting replication again from scratch. Also, you do not always have to take such an extreme measure to get replication up and running again. If your review the binlogs with mysqlbinlog on both the master and slave, you can find the transaction which might've failed and is causing replication to stop and resume replication at a different location...it might have to play catch up, but it will and replication will resume.
Also, I'm not sure what type of replication you are using, but the replication process itself does not lock tables as you described. It makes use of binary logging (binlogs) which are written copies of the actual SQL transactions, and are then read and recreated as relay logs on the slave. It is a continuous I/O thread that runs over the master/slave connection. This page explains the process quite well:
https://www.percona.com/blog/2013/01/09 ... ally-work/I'm still quite busy so I haven't gotten around to writing a tutorial. I'm working two sysadmin jobs, and recently started my own consulting company...but I will get one written up sometime. Maybe I'll post it on my company's site as well...although that is definitely one of our options we offer to clients and make considerable money using. Anyway, thanks for the feedback. I love this product, SIP technology in general, and the open source community you guys have here.