Skip to content
Snippets Groups Projects
  1. Jan 13, 2023
  2. Jan 10, 2023
  3. Jan 07, 2023
  4. Jan 01, 2023
  5. Oct 01, 2022
  6. Sep 30, 2022
  7. Sep 29, 2022
  8. Sep 28, 2022
    • Robert Carr's avatar
      BlastBufferQueue: Fake release if not received by complete · 405e2f68
      Robert Carr authored
      This is a cherry-pick from of a CL from sc-v2 adding a safety mechanism
      for lost buffers. The original CL probably should have been merged in to master,
      and I can't relate to my thought process adding DO NOT MERGE on the
      original CL. Recently we have started seeing some of these ANRs again.
      Well actually we have seen them continuously, but most have been due to
      GPU hangs. Recently we started seeing some which aren't due to GPU
      hangs. We found one trace, which showed the sort of situation described
      below, which lead me to realizing this code was never merged in to
      master. It shoud be relatively safe since we released it on sc-v2.
      
      Perfetto telemetry shows a cluster where a transactionComplete is
      arriving but an earlier releaseBufferCallback has never arrived.
      This leads to blocking and ANRs. We try to work around this by
      generating a fake release buffer callback for previous buffers when we
      receive a TransactionCompleteCallback. How can we know this is safe? The
      criteria to safely release buffers are as follows:
      	Condition 1. SurfaceFlinger does not plan to use the buffer
      	Condition 2. Have the acquire fence (or a later signalling fence)if
      	dropped, have the HWC release fence if presented.
      Now lets break down cases where the transaction complete callback
      arrives before the release buffer callback. All release buffer callbacks
      fall in to two categories:
      	Category 1. In band with the complete callback. In which case the client
      	library in SurfaceComposerClient always sends release callbacks
      	first.
      	Category 2. Out of band, e.g. from setBuffer or merge when dropping buffers
      
      In category 2 there are two scenarios:
      	Scenario 1: Release callback was never going to arrive (bug)
      	Scenario 2. Release callback descheduled, e.g.
      		a. Transaction is filled with buffer and held in VRI/WM/SysUI
      		b. A second transaction with buffer is merged in, the release
      		buffer callback is emitted but not scheduled (async binder)
      		c. The merged transaction is applied
      		d. SurfaceFlinger processes a frame and emits the transaction
      		complete callback
      		e. The transaction complete callback is scheduled before the
      		release callback is ever scheduled (since the 1 way binders are
      		from different processes).
      In both scenarios, we can satisfy Conditions 1 and 2, because the HWC
      present fence is a later signalling fence than the acquire fence which
      the out of band callback will pass. While we may block extra on this
      fence. We will be safe by condition 2. Receiving the transaction
      complete callback should indicate condition 1 is satisfied.
      
      In category 1 there should only be a single scenario, the release buffer
      callback for frame N-1 is emitted immediately before the transaction
      callback for frame N, and so if it hasn't come at that time (and isn't
      out of band/scheduled e.g. category 2) then it will never come. We can
      further break this down in to two scenarios:
      	1. stat.previousReleaseFence is set correctly. This is the
      	expected case. In this case, this is the same fence we would
      	receive from the release buffer callback, and so we can satisfy
      	condition 1 and 2 and safely release.
      	2. stat.previousReleaseFence is not set correctly. There is no
      	known reason for this to happen, but there is also no known
      	reason why we would receive a complete callback without the
      	corresponding release, and so we have to explore the option as
      	a potential risk/behavior change from the change.
      Hypothetically in category 1 case 2 we could experience some tearing.
      In category 1 we are currently experiencing ANR, and so the tradeoff
      is a risk of tearing vs a certainty of ANR. To tear the following would
      have to happen:
      	1. Enter the case where we previously ANRed
      	2. Enter the subcase where the release fence is not set
      	correctly
      	3. Be scheduled very fast on the client side, in practicality
      	HWC present has already returned and the fence should be firing
      	very soon
      	4. The previous frame not be GL comp, in which case we were
      	already done with the buffer and don't need the fence anyway.
      Conditions 2 and 3 should already make the occurence of tearing much
      lower than the occurence of ANRs. Furthermore 100% of observed ANRs
      were during GL comp (rotation animation) and so the telemetry indicates
      the risk of introducing tearing is very low.
      
      To review, we have shown that we can break down the side effects from the change in
      to two buckets (category 1, and category 2). In category 2, the possible negative side
      effect is to use too late of a fence. However this is still safe, and should
      be exceedingly rare. In category 1 we have shown that there are two scenarios,
      and in one of these scenarios we can experience tearing. We have furthermore argued
      that the risk of tearing should be exceedingly low even in this scenario,
      while the risk of ANR in this scenario was nearly 100%.
      
      Bug: 247246160
      Bug: 244218818
      Test: Existing tests pass
      Change-Id: I757ed132ac076aa811df816e04a68f57b38e47e7
      405e2f68
  9. Sep 27, 2022
    • Alec Mouri's avatar
      Fix use-after-free in SurfaceFlinger::doDump · 3b3e5918
      Alec Mouri authored
      SurfaceFlinger::doDump previously contained two layer traversals, one on
      the main thread and one off the main thread. During concurrent dumpsys
      commands, the layer traversals may race with each other, which causes
      shared ownership of the underlying storage of a SortedVector containing
      the z-ordered list of layers. Because the implementation of
      SortedVector's STL iterators assumes that the underlying storage may be
      edited, this can cause the storage to be copied whenever SortedVector::begin
      and SortedVector::end are called, which means that SortedVector::begin()
      + SortedVector::size() == SortedVector::end() is not always true, which
      causes invalid iteration.
      
      In general, this use-after-free can happen as long as the off-main
      thread traversal exists in doDump(), because the traversal can run in
      parallel with any workload on the main thread that executes a layer
      traversal. So, this patch moves the traversal for dumping out the list
      of composition layers into the main thread.
      
      A future patch could explore either fixing SortedVector to fix judicious
      iterator invalidation, or building LayerVector on top of std::set, but
      either option is an invasive data structure change.
      
      Bug: 237291506
      Test: Test script that calls dumpsys SurfaceFlinger on many threads
      Change-Id: I0748396519c924dc1b84113d44259f22d0d7ebd6
      (cherry picked from commit 6761733a)
      Merged-In: I0748396519c924dc1b84113d44259f22d0d7ebd6
      3b3e5918
  10. Sep 20, 2022
  11. Sep 19, 2022
  12. Sep 14, 2022
  13. Sep 13, 2022
Loading