On one of slaves I got error:
write failed: No space left on device (28)
I also found that slave got 1062 error (“Duplicate entry”) and stopped. I cleaned up some free space (old logs). When I tried to restart it with pt-slave-restart then I found that IO_Thread downloads binlogs from master and uses all free space again.
As a workaround – I decide to start just SQL_Thread, let it process all relay logs and then start IO_Thread again.
This is quick bash oneliner I created, which checks replication and if 1062 error exists then does skipping and starting SQL_Thread again.
while true; do if [[ $(mysql -e "show slave status\G" | grep "Last_SQL_Errno: 1062" -c) -gt 0 ]]; then mysql -e "set global sql_slave_skip_counter=1; start slave sql_thread;"; fi; done
This was enough to get issue fixed.
My main background for a recent years is MySQL DEV/DBA engineer but I’m also participating in wider range of tasks which could be likely called as DataOps/DevOps tasks. You can see my online CV here: http://catyellow.net