Skip to main content

Performance: Exchange Email and Calendar

Last Updated:
2011-11-10 05:00:00
Event:
2011-11-01 04:00:00
Status:
Closed
Brief Description:
Exchange
User Impact:
N/A
Workaround:
There is no workaround for this issue
Current Status:
N/A
Services Affected:
Full Description:
For the past several days, Cornell's Exchange email and calendar services have had performance issues. Re-establishing stable service levels is CIT's highest priority. Please bear with us as we continue working on the problem.
CIT TDX ID:



Timeline of Changes

Description Current Status Date Time
The immediate issues with Exchange have been resolved. Over the next several weeks, additional changes will be made to increase the Exchange's ability to handle normal growth in load over time and load associated with traffic spikes.\n\nA notice to all Exchange users will be sent later today.\n\nPlease report any issues with Exchange email or calendar to the CIT HelpDesk (255-8990), noting your email client and OS, and the location from which you observe the problem.\n\n The immediate issues with Exchange have been resolved. Over the next several weeks, additional changes will be made to increase the Exchange's ability to handle normal growth in load over time and load associated with traffic spikes.\n\nA notice to all Exchange users will be sent later today.\n\nPlease report any issues with Exchange email or calendar to the CIT HelpDesk (255-8990), noting your email client and OS, and the location from which you observe the problem.\n\n 2011-11-10 05:00:00
After making the recommended changes to the Exchange network configuration, which was complete by 1am today, the Exchange team has seen no recurrence of the server errors that indicate this problem. Spot checks with the community have indicated, in general, much improved performance this morning. If you have an open ticket with the CIT Help Desk, please update us with your current status. If you see any renewed or continuing problems, please report those to the Help Desk with details including your client and OS, and the location from which you observe the problem. After making the recommended changes to the Exchange network configuration, which was complete by 1am today, the Exchange team has seen no recurrence of the server errors that indicate this problem. Spot checks with the community have indicated, in general, much improved performance this morning. If you have an open ticket with the CIT Help Desk, please update us with your current status. If you see any renewed or continuing problems, please report those to the Help Desk with details including your client and OS, and the location from which you observe the problem. 2011-11-08 05:00:00
Our assessment of today's experience with the campus Exchange service is that the fixes applied yesterday and early this morning have addressed performance issues seen over the past several days. We have been working on what appear to be pockets of client issues remaining for a limited number of users. We will keep this alert open, however, until more time has elapsed and we can be certain there are no more infrastructure issues remaining. If you have an open ticket with the CIT Help Desk, please update us with your current status. If you see any renewed or continuing problems, please report those to the Help Desk with details including your client and OS, and the location from which you observe the problem.\n\nUnfortunately, there was a network outage in the CIT data center this afternoon that impacted Exchange access from about 1:00 to 2:00 PM. During the outage connections were refused. Some clients required a restart before they were able to connect once the network was restored, so some users may have seen problems after 2:00 PM. \n Our assessment of today's experience with the campus Exchange service is that the fixes applied yesterday and early this morning have addressed performance issues seen over the past several days. We have been working on what appear to be pockets of client issues remaining for a limited number of users. We will keep this alert open, however, until more time has elapsed and we can be certain there are no more infrastructure issues remaining. If you have an open ticket with the CIT Help Desk, please update us with your current status. If you see any renewed or continuing problems, please report those to the Help Desk with details including your client and OS, and the location from which you observe the problem.\n\nUnfortunately, there was a network outage in the CIT data center this afternoon that impacted Exchange access from about 1:00 to 2:00 PM. During the outage connections were refused. Some clients required a restart before they were able to connect once the network was restored, so some users may have seen problems after 2:00 PM. \n 2011-11-08 05:00:00
While the patches that were applied to the Exchange cluster on Friday greatly reduced the rate of errors, it's now apparent that some level of errors still persists. The Exchange team remains engaged with Microsoft to locate the source of these problems. Symptoms include timeouts in connection, refused connections, and errors in using OWA. If you receive these errors, please wait for a short time and retry the operation. The patch applied on Friday makes recovery from such problems much more rapid that before. While the patches that were applied to the Exchange cluster on Friday greatly reduced the rate of errors, it's now apparent that some level of errors still persists. The Exchange team remains engaged with Microsoft to locate the source of these problems. Symptoms include timeouts in connection, refused connections, and errors in using OWA. If you receive these errors, please wait for a short time and retry the operation. The patch applied on Friday makes recovery from such problems much more rapid that before. 2011-11-07 05:00:00
One of the Exchange databases servers (out of four) went offline and unmounted the mailbox databases. Exchange staff are working to get the databases back online. This problem does appear to be related to the ongoing issue. Expected time to restore the service is 30 minutes. One of the Exchange databases servers (out of four) went offline and unmounted the mailbox databases. Exchange staff are working to get the databases back online. This problem does appear to be related to the ongoing issue. Expected time to restore the service is 30 minutes. 2011-11-07 05:00:00
That database server is now online again. The start time was about 12:30, so it was a half hour from that time. That database server is now online again. The start time was about 12:30, so it was a half hour from that time. 2011-11-07 05:00:00
CIT staff continue to gather log data for Microsoft engineers to identify the source of the problem, which appears to continue to be in the cluster communications layer. \n\nResolving the issues with Exchange remains the highest priority for both CIT and Microsoft to resolve.\n\nThe main symptoms are sporadic slow or failed logins, failure to send messages, and slow operations (spinning hourglass or beach ball, depending on the client system). These have appeared a number of times throughout the morning, with a larger interruption from noon to 1pm for users hosted on one of the four mailbox servers. The server became non-responsive and required a reboot. \n\nAt this point, we have collected the data we need on client problems. If we need additional data to be reported, a request will be posted here.\n CIT staff continue to gather log data for Microsoft engineers to identify the source of the problem, which appears to continue to be in the cluster communications layer. \n\nResolving the issues with Exchange remains the highest priority for both CIT and Microsoft to resolve.\n\nThe main symptoms are sporadic slow or failed logins, failure to send messages, and slow operations (spinning hourglass or beach ball, depending on the client system). These have appeared a number of times throughout the morning, with a larger interruption from noon to 1pm for users hosted on one of the four mailbox servers. The server became non-responsive and required a reboot. \n\nAt this point, we have collected the data we need on client problems. If we need additional data to be reported, a request will be posted here.\n 2011-11-07 05:00:00
CIT and Microsoft experts are still diagnosing the cause of cluster communication failures. They are currently analyzing network traces for further information on anomalies identified in the review of Exchange data. CIT and Microsoft experts are still diagnosing the cause of cluster communication failures. They are currently analyzing network traces for further information on anomalies identified in the review of Exchange data. 2011-11-07 05:00:00
CIT has made some changes to network settings on the Exchange cluster at Microsoft's recommendation. \nWe are monitoring the performance to determine the effects of this change. CIT has made some changes to network settings on the Exchange cluster at Microsoft's recommendation. \nWe are monitoring the performance to determine the effects of this change. 2011-11-07 05:00:00
Microsoft has recommended that NetDMA be disabled in the Exchange cluster because it is a contributing factor to Cornell's Exchange issues. From 12 midnight to 12:15 am on Tuesday, Nov. 8, CIT will restart the Exchange mailbox servers to disable NetDMA. This work will be done one server at a time. No outage is expected. Microsoft has recommended that NetDMA be disabled in the Exchange cluster because it is a contributing factor to Cornell's Exchange issues. From 12 midnight to 12:15 am on Tuesday, Nov. 8, CIT will restart the Exchange mailbox servers to disable NetDMA. This work will be done one server at a time. No outage is expected. 2011-11-07 05:00:00
We have received some isolated reports of continued problems following the configuration change this afternoon around 4:00 PM, although we've seen a reduction in server-side errors. Microsoft has recommended an additional change to the server configuration which we are implementing from 12:00 midnight and 12:15 AM on Tuesday. The change requires rebooting the servers but we do not anticipate a service disruption. If you experienced problems today described in an earlier update (see list below) and continue to see them Tuesday morning, please report them to us.\n\nKnown symptoms are: sporadic slow or failed logins, failure to send messages, and slow operations (spinning hourglass or beach ball, depending on the client system).\n We have received some isolated reports of continued problems following the configuration change this afternoon around 4:00 PM, although we've seen a reduction in server-side errors. Microsoft has recommended an additional change to the server configuration which we are implementing from 12:00 midnight and 12:15 AM on Tuesday. The change requires rebooting the servers but we do not anticipate a service disruption. If you experienced problems today described in an earlier update (see list below) and continue to see them Tuesday morning, please report them to us.\n\nKnown symptoms are: sporadic slow or failed logins, failure to send messages, and slow operations (spinning hourglass or beach ball, depending on the client system).\n 2011-11-07 05:00:00
Overall, Exchange performance is much improved. However, we are still receiving reports from a subset of users who are having trouble connecting to their accounts. We are working with the Microsoft engineer to diagnose these cases and solve them. Overall, Exchange performance is much improved. However, we are still receiving reports from a subset of users who are having trouble connecting to their accounts. We are working with the Microsoft engineer to diagnose these cases and solve them. 2011-11-04 04:00:00
The root cause of recent Exchange problems has been addressed with hot fixes and reconfiguration of network traffic accomplished last night. Nonetheless, a subset of campus users experienced problems with the service today related to:\n\nA brief load spike at 9:00 AM this morning. This resulted in the temporary inability to connect to Exchange for some users. We are still investigating this event.\n\nA new problem was introduced with the addition of client access server capacity. These servers were not handling connections properly so we have eliminated them from the rotation. We have been working directly with the IT staff in the units impacted and believe that removing these servers has resolved those cases. We will continue to monitor reports until we are certain that no access issues remain. The root cause of recent Exchange problems has been addressed with hot fixes and reconfiguration of network traffic accomplished last night. Nonetheless, a subset of campus users experienced problems with the service today related to:\n\nA brief load spike at 9:00 AM this morning. This resulted in the temporary inability to connect to Exchange for some users. We are still investigating this event.\n\nA new problem was introduced with the addition of client access server capacity. These servers were not handling connections properly so we have eliminated them from the rotation. We have been working directly with the IT staff in the units impacted and believe that removing these servers has resolved those cases. We will continue to monitor reports until we are certain that no access issues remain. 2011-11-04 04:00:00
If people are still seeing problems with their email or calendar, as a first step, they should quit and restart their email client, and give it some time to catch up. In a few cases, it may be necessary to reboot their system. If problems persist, they should contact the CIT HelpDesk with these details: problem description, date and times the problem has occurred, and the operating system and email client being used. Having issues reported is critical.\n\nTIME LINE OF ACTIONS TAKEN\n\nEarly on, CIT staff identified and eliminated several apparent contributions to the problem, but ultimately came to an impasse. Paradoxically, adding additional resources to the cluster made the problem worse.\n\nWednesday evening, Nov. 2, Microsoft flew in a field engineer. With his help, we first identified a network bottleneck, which reduced but did not eliminate the problem. Digging deeper, a bug was identified in Microsoft's clustering software that caused the cluster to believe that it was in failure mode, and caused the active mailboxes to flip repeatedly between the redundant Exchange systems in Rhodes and CCC. Since this behavior was related to the number of machines in the cluster, we inadvertently worsened the problem by adding capacity. \n\nThursday night, Nov. 3, a patch was applied to the systems, and all the server side problems were eliminated. \n\nFriday morning, Nov. 4, pockets of connectivity problems led to discovering that a few of the ten Client Access Servers were not responding to connections; they were removed from the pool. At this time we believe that we have resolved the problems. \n If people are still seeing problems with their email or calendar, as a first step, they should quit and restart their email client, and give it some time to catch up. In a few cases, it may be necessary to reboot their system. If problems persist, they should contact the CIT HelpDesk with these details: problem description, date and times the problem has occurred, and the operating system and email client being used. Having issues reported is critical.\n\nTIME LINE OF ACTIONS TAKEN\n\nEarly on, CIT staff identified and eliminated several apparent contributions to the problem, but ultimately came to an impasse. Paradoxically, adding additional resources to the cluster made the problem worse.\n\nWednesday evening, Nov. 2, Microsoft flew in a field engineer. With his help, we first identified a network bottleneck, which reduced but did not eliminate the problem. Digging deeper, a bug was identified in Microsoft's clustering software that caused the cluster to believe that it was in failure mode, and caused the active mailboxes to flip repeatedly between the redundant Exchange systems in Rhodes and CCC. Since this behavior was related to the number of machines in the cluster, we inadvertently worsened the problem by adding capacity. \n\nThursday night, Nov. 3, a patch was applied to the systems, and all the server side problems were eliminated. \n\nFriday morning, Nov. 4, pockets of connectivity problems led to discovering that a few of the ten Client Access Servers were not responding to connections; they were removed from the pool. At this time we believe that we have resolved the problems. \n 2011-11-04 04:00:00
CIT staff with the Microsoft engineer who has been assisting us this week have applied patches to the cluster service supporting the Exchange system. These patches have eliminated the network errors and subsequent database restarts that have caused the extremely poor performance this week. At this time the Exchange service appears much healthier. Some email programs may have become confused when the Exchange system became unresponsive. If problems persist, we recommend that you quit and restart your email programs, and contact the CIT Help Desk if problems continue after that. CIT staff with the Microsoft engineer who has been assisting us this week have applied patches to the cluster service supporting the Exchange system. These patches have eliminated the network errors and subsequent database restarts that have caused the extremely poor performance this week. At this time the Exchange service appears much healthier. Some email programs may have become confused when the Exchange system became unresponsive. If problems persist, we recommend that you quit and restart your email programs, and contact the CIT Help Desk if problems continue after that. 2011-11-04 04:00:00
Working in concert with the Microsoft engineer last evening we have made configuration changes to alleviate Exchange performance issues. Measures included client access network reconfiguration, changes to the replication configuration, and deploying four additional client access servers. While we believe we have determined the root cause of these issues we will continue to analyze performance data to confirm. Working in concert with the Microsoft engineer last evening we have made configuration changes to alleviate Exchange performance issues. Measures included client access network reconfiguration, changes to the replication configuration, and deploying four additional client access servers. While we believe we have determined the root cause of these issues we will continue to analyze performance data to confirm. 2011-11-03 04:00:00
Between now and approximately 1 PM we will be making configuration changes to the Exchange environment to improve performance. The changes themselves are not expected to impact the user community. However, until these changes are complete we may see events similar to those we've experienced over the past several days that result in access issues for users. Such an event did occur this morning at 10 AM. It affected a significant number of users whose mailboxes live on the affected server. Those users would have experienced performance issues or the momentary inability to connect to their Exchange accounts.\n\nWe anticipate that very soon after we complete the configuration changes users will see the improvement in service performance. Between now and approximately 1 PM we will be making configuration changes to the Exchange environment to improve performance. The changes themselves are not expected to impact the user community. However, until these changes are complete we may see events similar to those we've experienced over the past several days that result in access issues for users. Such an event did occur this morning at 10 AM. It affected a significant number of users whose mailboxes live on the affected server. Those users would have experienced performance issues or the momentary inability to connect to their Exchange accounts.\n\nWe anticipate that very soon after we complete the configuration changes users will see the improvement in service performance. 2011-11-03 04:00:00
We are still working with the Microsoft engineer to accomplish the reconfiguration referenced in the last communication. Although we initially anticipated that work would be completed around 1 PM, we now expect it will take several more hours. We expect these changes will result in a stable service very soon after they are completed but we will continue to take incremental steps to increase capacity to better accommodate future unplanned events. We are still working with the Microsoft engineer to accomplish the reconfiguration referenced in the last communication. Although we initially anticipated that work would be completed around 1 PM, we now expect it will take several more hours. We expect these changes will result in a stable service very soon after they are completed but we will continue to take incremental steps to increase capacity to better accommodate future unplanned events. 2011-11-03 04:00:00
We are still working on reconfiguring the network path for Exchange communications to better distribute the traffic. We have engaged additional Microsoft resources over the phone to expedite resolution of issues we've encountered with this change. We are still working on reconfiguring the network path for Exchange communications to better distribute the traffic. We have engaged additional Microsoft resources over the phone to expedite resolution of issues we've encountered with this change. 2011-11-03 04:00:00
Exchange mailboxes may be temporarily unavailable due to a cluster communications problem we expect this condition to last for less than 30 minutes. Exchange mailboxes may be temporarily unavailable due to a cluster communications problem we expect this condition to last for less than 30 minutes. 2011-11-03 04:00:00
Technical staff working on Exchange performance issues have applied a patch to the server cluster to address a bug that was causing communication failures. This should improve stability and allow the reconfiguration work to proceed. Technical staff working on Exchange performance issues have applied a patch to the server cluster to address a bug that was causing communication failures. This should improve stability and allow the reconfiguration work to proceed. 2011-11-03 04:00:00
CIT is continuing to work on solutions to the Exchange performance issues. Our next step is to address a communications problem between the two halves of the Exchange cluster. We are also working to add another Exchange 2010 server as soon as tonight. In our test environment, we will be assessing a newly released Microsoft patch that contains fixes for some of the problems we have been seeing. CIT is continuing to work on solutions to the Exchange performance issues. Our next step is to address a communications problem between the two halves of the Exchange cluster. We are also working to add another Exchange 2010 server as soon as tonight. In our test environment, we will be assessing a newly released Microsoft patch that contains fixes for some of the problems we have been seeing. 2011-11-02 04:00:00
CIT continues to work on resolving the Exchange performance issues. Additional servers will be added to Exchange tonight (Nov. 2) to spread the load. \n\nProblems with the replication service are being investigated, including determining whether a Microsoft patch would resolve them.\n\nA Microsoft engineer will be on site tonight (Nov. 2), and CIT will be taking additional measures based on those recommendations. CIT continues to work on resolving the Exchange performance issues. Additional servers will be added to Exchange tonight (Nov. 2) to spread the load. \n\nProblems with the replication service are being investigated, including determining whether a Microsoft patch would resolve them.\n\nA Microsoft engineer will be on site tonight (Nov. 2), and CIT will be taking additional measures based on those recommendations. 2011-11-02 04:00:00
We are currently investigating this problem and will notify you with updates on this situation. We are currently investigating this problem and will notify you with updates on this situation. 2011-11-01 04:00:00
CIT is still receiving reports that some users are still unable to access their Exchange email. CIT is still investigating and will provide further updates. CIT is still receiving reports that some users are still unable to access their Exchange email. CIT is still investigating and will provide further updates. 2011-11-01 04:00:00
Exchange Admins are actively working with Microsoft\nto resolve the problem swiftly. Additional information\nwill be posted as it becomes available. Exchange Admins are actively working with Microsoft\nto resolve the problem swiftly. Additional information\nwill be posted as it becomes available. 2011-11-01 04:00:00
CIT understands the importance of email and calendar for your work, and we realize we have fallen short of your expectations. We are working hard to regain those service levels. We have been working with Microsoft and others to understand what is causing these problems. \n\nSo far the causes have been elusive, appearing at times to be a high CPU load causing poor response time, and at other times seeming to be an intermittent network problem. Several apparent causes have been addressed, including anti-virus updates, network adapter offload settings, power management settings, and the mailbox automounting setting. Please bear with us as we continue working on the problem. CIT understands the importance of email and calendar for your work, and we realize we have fallen short of your expectations. We are working hard to regain those service levels. We have been working with Microsoft and others to understand what is causing these problems. \n\nSo far the causes have been elusive, appearing at times to be a high CPU load causing poor response time, and at other times seeming to be an intermittent network problem. Several apparent causes have been addressed, including anti-virus updates, network adapter offload settings, power management settings, and the mailbox automounting setting. Please bear with us as we continue working on the problem. 2011-11-01 04:00:00
Exchange performance has been stabilized for the moment. Some Microsoft-recommended changes to the Active Directory Domain Controllers were implemented, as well as monitors that will capture diagnostic information if the problems return tomorrow during periods of high load. \n\nWe also have a fourth Exchange database server ready to go into production, which will give us 33% more capacity to deal with load issues. A fifth server will be added in another week. These will have a gradual affect as user mailboxes migrate transparently onto them.\n Exchange performance has been stabilized for the moment. Some Microsoft-recommended changes to the Active Directory Domain Controllers were implemented, as well as monitors that will capture diagnostic information if the problems return tomorrow during periods of high load. \n\nWe also have a fourth Exchange database server ready to go into production, which will give us 33% more capacity to deal with load issues. A fifth server will be added in another week. These will have a gradual affect as user mailboxes migrate transparently onto them.\n 2011-11-01 04:00:00