Europe Azure users hit when freak storm took out fiber link

[ad_1]

A freak summer storm in the Netherlands is being blamed for causing network issues in Microsoft’s Azure West Europe region last week, according to a preliminary post-incident review by the company.

The weather event, named Storm Poly, has been described as the strongest summer storm in the country’s records. It hit the Netherlands last Wednesday, July 5, with winds of up to 146 kilometers (90 miles) per hour, according to reports, causing at least one death and leaving a trail of damage.

That damage included a fiber optic connection carrying traffic between Microsoft’s cloud datacenters, leading to customers experiencing packet drops, timeouts, and/or increased latency between approximately 07:22 UTC and 16:00 UTC on July 5.

According to the preliminary post-incident review, Azure’s West Europe region is outfitted with four independent fiber paths for traffic flows between datacenters. With one severed, a quarter of the network bandwidth between two campuses of West Europe datacenters was unavailable.

This might not have been too a serious issue, but the links were already running at higher utilization than the design target, the review reports. There was apparently a capacity upgrade project already in progress to address this when the incident occurred, Microsoft states.

As a result of the cut fiber, congestion on the remaining links increased to a point where packet drops started to occur. This seems to have impacted network traffic between Availability Zones within the West Europe region itself rather than traffic to and from the region, resulting in degraded performance for Azure services that depend on other local services within the region.

Microsoft states that its on-call engineers began to investigate immediately, and one remedial effort focused on reducing traffic in the region and balancing it across the remaining links, while work on repairing the impacted link with its dark fiber provider in the Netherlands started in parallel.

With throttling and migration of internal service traffic away from the region in place, packet drops had reduced significantly by about 14:52 UTC, Microsoft claims, such that by 15:30 UTC many internal and external services showed signs of recovery, and by 16:00 UTC packet drops had returned to pre-incident levels.

The actual physical repairs were hindered by hazardous working conditions due to the ongoing storm, but full restoration was confirmed by 20:50 UTC, according to Microsoft, which declared the incident mitigated by 22:45 UTC.

This information is from the preliminary review that Microsoft said it aims to produce within 72 hours of an incident. A final version will be published once the internal review is completed (generally within 14 days) with additional details.

In response to the incident, the Redmond giant said it brought additional capacity online within 24 hours, and is working on augmenting capacity in the region further.

Customers affected by the incident can provide feedback to Microsoft on its handling of the incident via a survey.

A month earlier, Azure was hit with another outage in Brazil when a simple typo in a routine job led to entire Azure SQL Server instances being deleted instead of old database snapshots. ®

[ad_2]

Source link