Database Crash on 3/6/2012
6 March 2012 at 11:17 pm | Posted in Service Updates | Leave a commentAnother of our databases crashed at 11:30 am PST. The crash was the result of the same bug encountered on 2/24, but was a different database instance. With assistance from Percona we inspected the core file to determine the corrupt table which was then dumped and dropped while in recovery mode and then restored with the the database in normal mode. Full operation was restored at 12:30 pm PST.
While the database was down those customers whose accounts reside on that instance were unable to log in or view their data. After the database recovered and data resumed flowing into our system those customers may see a flat line during the outage window due to the down-sampled data being held in the agent while it was unable to report data.
After the crash on 2/24 Percona informed us of a work-around that would prevent the bug from occurring. However, enabling the work-around may have impacted the performance of DELETE operations. Due to the potential performance impact of the workaround, we decided to test it thoroughly before we rolled it our across all the database instances. Unfortunately the bug appeared again before we could conclude testing.
Needless to say the workaround has been applied across all our databases at this point and we are eagerly awaiting Percona Server 5.5.21 which includes a permanent fix for this bug.
Shard outage on 02/24/2012
25 February 2012 at 12:21 am | Posted in Service Updates | Leave a commentAround 21:21 Pacific on 02/24/2012 one of our shards crashed. We were able to bring the database back online at 23:39 PM. During the outage customers on that shard were unable to access their data. We also were not collecting data on that shard during that time.
It seems like we hit MySQL bug 61104. http://bugs.mysql.com/bug.php?id=61104
We attempted to restart the database a couple of times but it crashed repeatedly. As per the bug we used a workaround and added innodb_change_buffering = inserts to the my.cnf file . We then restarted MySQL with innodb_force_recovery = 4
and innodb_purge_threads=0. We did some diagnosis and discovered we had a corrupt index. We dropped the index, reloaded the table and then recreated the index. When this was complete we brought the shard back online.
We are running Percona Server 5.5.15. This bug is fixed in Percona 5.5.17. We will be testing this version of MySQL and will upgrade once thorough testing has been completed.
Thanks,
Bayard Carlin
Maintenance on 03/09/2012
24 February 2012 at 1:31 pm | Posted in Service Updates | Leave a commentWe will be doing some upgrades to our database shards on Friday March 9th starting around 8PM Pacific time. There will be an outage on each shard as it is upgraded. We anticipate no more than 30 minutes downtime per shard. During the maintenance customers will be unable to post or view data. We will send out additional reminders before March 9th.
Service issue on 02/15/2012 at 17:30 PST
16 February 2012 at 10:02 am | Posted in Service Updates | Leave a commentYesterday evening around 5:30 PM Pacific we experienced an outage caused by a fiber cut at our co-location site. In theory the fiber cut should have had no effect as it was a redundant loop. In practice it triggered a routing event that caused a complete outage for at least 15 minutes.
Our co-location provider is investigating the root cause of the routing event
We are actively working on bringing additional bandwidth providers online and implementing redundant links so we can lessen the impact of future events with any one provider.
This is a priority for us however we do not have an exact ETA on completion yet as we are dependent on ARIN giving us an ASN, IP assignments etc. This can be time consuming.
We will provide periodic updates as milestones on this project are met.
Thank you,
The New Relic Team.
Service Problems
1 February 2012 at 9:37 am | Posted in Service Updates | Leave a commentFor 35 minutes yesterday (Jan 31) between 4:25 and 5:00pm PST we experienced sporadic problems with our service. The problems effected everything — our Real User Monitoring beacon lost some data during that time; and while the collection tier handling application agent data was also impacted that is likely invisible to our customers because of the buffering that occurs in our agent. And the UI was unavailable off-and-on for some portions of the time period.
A change was made to the production load balancers that caused both of them use the same IP address. It took a few minutes to isolate the problem at which point we assumed the change was mis-applied and that the system automatically failed over to the standby but that the primary didn’t fully relinquish control. Operating under that assumption we rebooted the primary load balancer thinking that would fix the problems but it didn’t. We then completely turned off the primary and the site fully recovered.
The underlying cause was a misunderstanding in how the load balancer configuration was backed up and attempting to use the backup configuration as a mechanism for pushing changes. We use Git to back up the configuration by pushing changes from the load balancers to the git repo. Unfortunately the synchronization between the pair of load balancers also synchronized the “.git” directory which caused both load balancers in the pair to be pointed to the same repo and when the configuration was pulled and applied the two had identical configs instead of complementary pair configurations. This is what caused them both to have the same IP addresses.
We ran the site on the secondary load balancer until later in the evening. Then we completely isolated the mis-configured primary from the network and restarted it and then logged in via the console to revert the configuration changes. Once it was back in a working state we re-synced the configuration from the secondary and failed back to the primary to get the site back fully operational and redundant.
Blog at WordPress.com. | Theme: Customized Pool by Borja Fernandez.
Entries and comments feeds.