Robert Carr
authored
This is a cherry-pick from of a CL from sc-v2 adding a safety mechanism for lost buffers. The original CL probably should have been merged in to master, and I can't relate to my thought process adding DO NOT MERGE on the original CL. Recently we have started seeing some of these ANRs again. Well actually we have seen them continuously, but most have been due to GPU hangs. Recently we started seeing some which aren't due to GPU hangs. We found one trace, which showed the sort of situation described below, which lead me to realizing this code was never merged in to master. It shoud be relatively safe since we released it on sc-v2. Perfetto telemetry shows a cluster where a transactionComplete is arriving but an earlier releaseBufferCallback has never arrived. This leads to blocking and ANRs. We try to work around this by generating a fake release buffer callback for previous buffers when we receive a TransactionCompleteCallback. How can we know this is safe? The criteria to safely release buffers are as follows: Condition 1. SurfaceFlinger does not plan to use the buffer Condition 2. Have the acquire fence (or a later signalling fence)if dropped, have the HWC release fence if presented. Now lets break down cases where the transaction complete callback arrives before the release buffer callback. All release buffer callbacks fall in to two categories: Category 1. In band with the complete callback. In which case the client library in SurfaceComposerClient always sends release callbacks first. Category 2. Out of band, e.g. from setBuffer or merge when dropping buffers In category 2 there are two scenarios: Scenario 1: Release callback was never going to arrive (bug) Scenario 2. Release callback descheduled, e.g. a. Transaction is filled with buffer and held in VRI/WM/SysUI b. A second transaction with buffer is merged in, the release buffer callback is emitted but not scheduled (async binder) c. The merged transaction is applied d. SurfaceFlinger processes a frame and emits the transaction complete callback e. The transaction complete callback is scheduled before the release callback is ever scheduled (since the 1 way binders are from different processes). In both scenarios, we can satisfy Conditions 1 and 2, because the HWC present fence is a later signalling fence than the acquire fence which the out of band callback will pass. While we may block extra on this fence. We will be safe by condition 2. Receiving the transaction complete callback should indicate condition 1 is satisfied. In category 1 there should only be a single scenario, the release buffer callback for frame N-1 is emitted immediately before the transaction callback for frame N, and so if it hasn't come at that time (and isn't out of band/scheduled e.g. category 2) then it will never come. We can further break this down in to two scenarios: 1. stat.previousReleaseFence is set correctly. This is the expected case. In this case, this is the same fence we would receive from the release buffer callback, and so we can satisfy condition 1 and 2 and safely release. 2. stat.previousReleaseFence is not set correctly. There is no known reason for this to happen, but there is also no known reason why we would receive a complete callback without the corresponding release, and so we have to explore the option as a potential risk/behavior change from the change. Hypothetically in category 1 case 2 we could experience some tearing. In category 1 we are currently experiencing ANR, and so the tradeoff is a risk of tearing vs a certainty of ANR. To tear the following would have to happen: 1. Enter the case where we previously ANRed 2. Enter the subcase where the release fence is not set correctly 3. Be scheduled very fast on the client side, in practicality HWC present has already returned and the fence should be firing very soon 4. The previous frame not be GL comp, in which case we were already done with the buffer and don't need the fence anyway. Conditions 2 and 3 should already make the occurence of tearing much lower than the occurence of ANRs. Furthermore 100% of observed ANRs were during GL comp (rotation animation) and so the telemetry indicates the risk of introducing tearing is very low. To review, we have shown that we can break down the side effects from the change in to two buckets (category 1, and category 2). In category 2, the possible negative side effect is to use too late of a fence. However this is still safe, and should be exceedingly rare. In category 1 we have shown that there are two scenarios, and in one of these scenarios we can experience tearing. We have furthermore argued that the risk of tearing should be exceedingly low even in this scenario, while the risk of ANR in this scenario was nearly 100%. Bug: 247246160 Bug: 244218818 Test: Existing tests pass Change-Id: I757ed132ac076aa811df816e04a68f57b38e47e7
Name | Last commit | Last update |
---|