multithreaded rendering

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

multithreaded rendering

Yale Zhang
In my quest for a more fluid experience with fewer distractions, I've
attempted to multithread the rendering. I see this has been a long
standing discussion:

https://bugs.launchpad.net/inkscape/+bug/200415
https://bugs.launchpad.net/inkscape/+bug/330271

I'm using a Lenovo P40 tablet and the total frame rendering time for a
simple piece is 80 to 100ms (1920x1080, a few hundred vertices on 2
layers with no filter effects - only alpha compositing). This slow
rendering speed makes the touchscreen zooming I recently implemented
very jerky.

So, I tried multithreading SPCanvas::paintRectInternal() with OpenMP
by splitting the rectangle into 2 and rendering them in ||.
I used mutual exclusion for some obviously thread unsafe code like the
call to markRect in SPCanvas::paintSingleBuffer(). The rendering would
work for a few frames before it freezes (waiting threads timeout and
then exit).

Then, I put other calls that I suspected were thread unsafe in
mutually exclusive blocks until I discovered _root->render() isn't
safe. No point in going further.

Excuse my naive attempt. Can anyone guess how feasible it is to
multithread the rendering? For now, I don't care if it's pixel
perfect. I just need something that's decent and doesn't crash/freeze.

I'm also wondering why the Cairo OpenGL backend isn't being used? GPU
rendering on integrated GPUs should give a nice speedup since there
should be no copying overhead.

The other place to optimize is pixman. I did some profiling (rapidly
zooming in and out with touchscreen) and >= 25% of the time is spent
in pixman rendering. I already went ahead and ported a few to AVX2 and
got ~1.3x speedup (should get more since my laptop is bottlenecked by
memory bandwidth owing to having only 1 memory channel).


Function                                                    Module
         Samples
sse2_blt.part.0
libpixman-1-0.dll    4221
sse2_combine_in_u
libpixman-1-0.dll    2189
sse2_fill
libpixman-1-0.dll    1693
cairo_tor_scan_converter_generate
libcairo-2.dll       1494
sse2_composite_over_8888_8888
libpixman-1-0.dll    1424
bits_image_fetch_separable_convolution_affine_none_a8r8g8b8
libpixman-1-0.dll    1104
feed_curve_to_cairo(_cairo*Geom::Curve const&
libinkscape_base.dll 611
fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC
libpixman-1-0.dll    475
fill_xrgb32_lerp_opaque_spans
libcairo-2.dll       348
cairo_tor_scan_converter_add_polygon
libcairo-2.dll       260
compute_face
libcairo-2.dll       238
_dynamic_cast
libstdc++-6.dll      209
outer_join
libcairo-2.dll       179
cairo_polygon_add_edge
libcairo-2.dll       178
g_hash_table_lookup
libglib-2.0-0.dll    169
cairo_spline_decompose_into
libcairo-2.dll       153
g_slice_alloc
libglib-2.0-0.dll    138
cairo_spline_intersects
libcairo-2.dll       131
feed_pathvector_to_cairo(_cairo*Geom::PathVector    const&)
libinkscape_base.dll 127
line_to
libcairo-2.dll       119
void std::vector<Geom::Pointstd::allocator<Geom::Point>
libinkscape_base.dll 116
cairo_matrix_transform_point
libcairo-2.dll       110
cell_list_render_edge
libcairo-2.dll       106
g_type_check_instance_is_a
libgobject-2.0-0.dll 106

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Inkscape-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/inkscape-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multithreaded rendering

Bryce Harrington-3
On Sat, Dec 31, 2016 at 06:21:40PM -0500, Yale Zhang wrote:

> In my quest for a more fluid experience with fewer distractions, I've
> attempted to multithread the rendering. I see this has been a long
> standing discussion:
>
> https://bugs.launchpad.net/inkscape/+bug/200415
> https://bugs.launchpad.net/inkscape/+bug/330271
>
> I'm using a Lenovo P40 tablet and the total frame rendering time for a
> simple piece is 80 to 100ms (1920x1080, a few hundred vertices on 2
> layers with no filter effects - only alpha compositing). This slow
> rendering speed makes the touchscreen zooming I recently implemented
> very jerky.
>
> So, I tried multithreading SPCanvas::paintRectInternal() with OpenMP
> by splitting the rectangle into 2 and rendering them in ||.
> I used mutual exclusion for some obviously thread unsafe code like the
> call to markRect in SPCanvas::paintSingleBuffer(). The rendering would
> work for a few frames before it freezes (waiting threads timeout and
> then exit).
>
> Then, I put other calls that I suspected were thread unsafe in
> mutually exclusive blocks until I discovered _root->render() isn't
> safe. No point in going further.
>
> Excuse my naive attempt. Can anyone guess how feasible it is to
> multithread the rendering? For now, I don't care if it's pixel
> perfect. I just need something that's decent and doesn't crash/freeze.

Yes, various people have looked at multi-threading before, but not
founda feasible way to attack it.
 
> I'm also wondering why the Cairo OpenGL backend isn't being used? GPU
> rendering on integrated GPUs should give a nice speedup since there
> should be no copying overhead.

On Linux, the cairo library is typically shipped with its GL backend
disabled, so that presents sort of a logistical roadblock that'd need
solved.  Also, while theoretically you're right it should provide a
performance boost, it's not guaranteed.  OpenGL has been experimental in
Cairo and not as thoroughly tested as the X and other backends, so there
may well be corner cases where performance is poorer.  But no way to
know for certain except to hook it up and try it out.  A number of us
have had this task on our todo list but I don't think anyone's taken a
solid shot at it yet.

> The other place to optimize is pixman. I did some profiling (rapidly
> zooming in and out with touchscreen) and >= 25% of the time is spent
> in pixman rendering. I already went ahead and ported a few to AVX2 and
> got ~1.3x speedup (should get more since my laptop is bottlenecked by
> memory bandwidth owing to having only 1 memory channel).

Since pixman is low level and widely used, optimzations would be very
interesting.  I don't know how widespread AVX2 is, or if the 1.3x
improvement is a large enough benefit to warrant considering it for
Pixman, though.  Regardless, I'd be interested in learning more of your
work along these paths.  Perhaps you'll discover something worth
inclusion in upstream codebases?

Thanks,
Bryce

>
> Function                                                    Module
>          Samples
> sse2_blt.part.0
> libpixman-1-0.dll    4221
> sse2_combine_in_u
> libpixman-1-0.dll    2189
> sse2_fill
> libpixman-1-0.dll    1693
> cairo_tor_scan_converter_generate
> libcairo-2.dll       1494
> sse2_composite_over_8888_8888
> libpixman-1-0.dll    1424
> bits_image_fetch_separable_convolution_affine_none_a8r8g8b8
> libpixman-1-0.dll    1104
> feed_curve_to_cairo(_cairo*Geom::Curve const&
> libinkscape_base.dll 611
> fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC
> libpixman-1-0.dll    475
> fill_xrgb32_lerp_opaque_spans
> libcairo-2.dll       348
> cairo_tor_scan_converter_add_polygon
> libcairo-2.dll       260
> compute_face
> libcairo-2.dll       238
> _dynamic_cast
> libstdc++-6.dll      209
> outer_join
> libcairo-2.dll       179
> cairo_polygon_add_edge
> libcairo-2.dll       178
> g_hash_table_lookup
> libglib-2.0-0.dll    169
> cairo_spline_decompose_into
> libcairo-2.dll       153
> g_slice_alloc
> libglib-2.0-0.dll    138
> cairo_spline_intersects
> libcairo-2.dll       131
> feed_pathvector_to_cairo(_cairo*Geom::PathVector    const&)
> libinkscape_base.dll 127
> line_to
> libcairo-2.dll       119
> void std::vector<Geom::Pointstd::allocator<Geom::Point>
> libinkscape_base.dll 116
> cairo_matrix_transform_point
> libcairo-2.dll       110
> cell_list_render_edge
> libcairo-2.dll       106
> g_type_check_instance_is_a
> libgobject-2.0-0.dll 106
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Inkscape-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/inkscape-devel

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Inkscape-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/inkscape-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multithreaded rendering

Yale Zhang
Thanks for the encouragement. It's both encouraging and intimidating
to hear that multithreaded rendering hasn't been seriously attempted
before.

I have an ace up my sleeve: GCC/LLVM's thread sanitizer. It works like
Valgrind/AddressSanitizer but reports race conditions instead of
buffer overflows. I figured out how to use it and iteratively
eliminated all the fatal race conditions one by one. Patch is attached
if anyone wants to try.

Here're the speedups I got on a complex scene with lots of filters
(unexpected_visitor.svg from the discussion on my vectorized gaussian
blur)

1 thread: 4.2s
8 (hyperthread): 1.1
1 thread (8 for filters): 2.6

CPU: Intel 4770 @3.4 GHz
memory:  2 channels DDR3 @ 1866 MHz
OS: Windows 10

So a ~2.4x speedup over the current implementation. Not bad, but on
another synthetic scene (3 heavily blurred boxes), multithreading is
actually > 2x slower than single thread !  This was very puzzling, but
I finally figured it out. It's because it's doing almost 4x more work.
When rendering a filtered object, all objects behind that one have to
be rendered immediately and that intermediate rendering can have a
larger area than the rendered region itself since filters can access
neighboring pixels.  For the blur filter, the expanded region was way
bigger than the rectangle each thread was rendering to! It also
explains why the current renderer without multithreading is very slow
when zooming into a heavily blurred region. It's because the rendering
is done in blocks to improve responsiveness. But the block size (64k)
is too small. Please consider increasing this - make it 1/8 of the
window height?

I hope this isn't a fundamental problem. Any thoughts on how this
might be improved?

The other, safer approach is to multithread pixman, but that also has
lots of challenges and probably won't be as fast for most scenes:
-lots of small functions to optimize
-some functions like Cairo's scanline rendering
(cairo_tor_scanline_converter_generate()) are probably difficult or
too small to be multithreaded
-needs more forks/joins (fine grained ||ism). This will be bad on
Windows, where the pthread wake up latency is 7 times longer than on
Linux.


"I don't know how widespread AVX2 is, or if the 1.3x improvement is a
large enough benefit to warrant considering it for [pixman]"

AVX2 should be on all Intel processors since summer, 2013 (Haswell). I
measured the speedups again on my desktop with > 2x the memory
bandwidth of my laptop and it's still quite weak. It must be that
those functions like blits (2D memcpy), fills, composite_in,
composite_out, are all bandwidth limited, so wider SIMD isn't much
help. You might say this bottleneck would contradict the reported
speedups above, but keep in mind that 1 core alone can't fully use up
all the memory bandwidth.

I'll have a discussion with the pixman developers to see what they
think. On a related note, I've also submitted a patch for Windows
touchscreen support in GTK:
https://bugzilla.gnome.org/show_bug.cgi?id=776568

-Yale

On Sun, Jan 1, 2017 at 1:42 AM, Bryce Harrington
<[hidden email]> wrote:

> On Sat, Dec 31, 2016 at 06:21:40PM -0500, Yale Zhang wrote:
>> In my quest for a more fluid experience with fewer distractions, I've
>> attempted to multithread the rendering. I see this has been a long
>> standing discussion:
>>
>> https://bugs.launchpad.net/inkscape/+bug/200415
>> https://bugs.launchpad.net/inkscape/+bug/330271
>>
>> I'm using a Lenovo P40 tablet and the total frame rendering time for a
>> simple piece is 80 to 100ms (1920x1080, a few hundred vertices on 2
>> layers with no filter effects - only alpha compositing). This slow
>> rendering speed makes the touchscreen zooming I recently implemented
>> very jerky.
>>
>> So, I tried multithreading SPCanvas::paintRectInternal() with OpenMP
>> by splitting the rectangle into 2 and rendering them in ||.
>> I used mutual exclusion for some obviously thread unsafe code like the
>> call to markRect in SPCanvas::paintSingleBuffer(). The rendering would
>> work for a few frames before it freezes (waiting threads timeout and
>> then exit).
>>
>> Then, I put other calls that I suspected were thread unsafe in
>> mutually exclusive blocks until I discovered _root->render() isn't
>> safe. No point in going further.
>>
>> Excuse my naive attempt. Can anyone guess how feasible it is to
>> multithread the rendering? For now, I don't care if it's pixel
>> perfect. I just need something that's decent and doesn't crash/freeze.
>
> Yes, various people have looked at multi-threading before, but not
> founda feasible way to attack it.
>
>> I'm also wondering why the Cairo OpenGL backend isn't being used? GPU
>> rendering on integrated GPUs should give a nice speedup since there
>> should be no copying overhead.
>
> On Linux, the cairo library is typically shipped with its GL backend
> disabled, so that presents sort of a logistical roadblock that'd need
> solved.  Also, while theoretically you're right it should provide a
> performance boost, it's not guaranteed.  OpenGL has been experimental in
> Cairo and not as thoroughly tested as the X and other backends, so there
> may well be corner cases where performance is poorer.  But no way to
> know for certain except to hook it up and try it out.  A number of us
> have had this task on our todo list but I don't think anyone's taken a
> solid shot at it yet.
>
>> The other place to optimize is pixman. I did some profiling (rapidly
>> zooming in and out with touchscreen) and >= 25% of the time is spent
>> in pixman rendering. I already went ahead and ported a few to AVX2 and
>> got ~1.3x speedup (should get more since my laptop is bottlenecked by
>> memory bandwidth owing to having only 1 memory channel).
>
> Since pixman is low level and widely used, optimzations would be very
> interesting.  I don't know how widespread AVX2 is, or if the 1.3x
> improvement is a large enough benefit to warrant considering it for
> Pixman, though.  Regardless, I'd be interested in learning more of your
> work along these paths.  Perhaps you'll discover something worth
> inclusion in upstream codebases?
>
> Thanks,
> Bryce
>
>>
>> Function                                                    Module
>>          Samples
>> sse2_blt.part.0
>> libpixman-1-0.dll    4221
>> sse2_combine_in_u
>> libpixman-1-0.dll    2189
>> sse2_fill
>> libpixman-1-0.dll    1693
>> cairo_tor_scan_converter_generate
>> libcairo-2.dll       1494
>> sse2_composite_over_8888_8888
>> libpixman-1-0.dll    1424
>> bits_image_fetch_separable_convolution_affine_none_a8r8g8b8
>> libpixman-1-0.dll    1104
>> feed_curve_to_cairo(_cairo*Geom::Curve const&
>> libinkscape_base.dll 611
>> fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC
>> libpixman-1-0.dll    475
>> fill_xrgb32_lerp_opaque_spans
>> libcairo-2.dll       348
>> cairo_tor_scan_converter_add_polygon
>> libcairo-2.dll       260
>> compute_face
>> libcairo-2.dll       238
>> _dynamic_cast
>> libstdc++-6.dll      209
>> outer_join
>> libcairo-2.dll       179
>> cairo_polygon_add_edge
>> libcairo-2.dll       178
>> g_hash_table_lookup
>> libglib-2.0-0.dll    169
>> cairo_spline_decompose_into
>> libcairo-2.dll       153
>> g_slice_alloc
>> libglib-2.0-0.dll    138
>> cairo_spline_intersects
>> libcairo-2.dll       131
>> feed_pathvector_to_cairo(_cairo*Geom::PathVector    const&)
>> libinkscape_base.dll 127
>> line_to
>> libcairo-2.dll       119
>> void std::vector<Geom::Pointstd::allocator<Geom::Point>
>> libinkscape_base.dll 116
>> cairo_matrix_transform_point
>> libcairo-2.dll       110
>> cell_list_render_edge
>> libcairo-2.dll       106
>> g_type_check_instance_is_a
>> libgobject-2.0-0.dll 106
>>
>> ------------------------------------------------------------------------------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Inkscape-devel mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/inkscape-devel

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Inkscape-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/inkscape-devel

multithreaded_rendering.diff (28K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multithreaded rendering

Tavmjong Bah

Hi Yale,

This all looks quite interesting. I've CC'd our resident rendering
expert, Krzysztof, who can probably give you the best feedback.

I've wondered for some time on how useful splitting the screen up into
tiles really is. Computers have become much faster than they were since
that code was written. And your right, it certainly is a big slow down
when you zoom in close to an object that has a filter as one still must
calculate a quite large area to handle filters that use a large
displacement or a large blur radius... and now you have to recalculate
it multiple times.

I don't have a good feel for what 64k block size really means vs. 1/8
of a screen. What block size would that require (or conversely, how big
of an area does a 64k block size correspond to)?

Tav


On Sun, 2017-01-08 at 19:38 -0800, Yale Zhang wrote:

> Thanks for the encouragement. It's both encouraging and intimidating
> to hear that multithreaded rendering hasn't been seriously attempted
> before.
>
> I have an ace up my sleeve: GCC/LLVM's thread sanitizer. It works
> like
> Valgrind/AddressSanitizer but reports race conditions instead of
> buffer overflows. I figured out how to use it and iteratively
> eliminated all the fatal race conditions one by one. Patch is
> attached
> if anyone wants to try.
>
> Here're the speedups I got on a complex scene with lots of filters
> (unexpected_visitor.svg from the discussion on my vectorized gaussian
> blur)
>
> 1 thread: 4.2s
> 8 (hyperthread): 1.1
> 1 thread (8 for filters): 2.6
>
> CPU: Intel 4770 @3.4 GHz
> memory:  2 channels DDR3 @ 1866 MHz
> OS: Windows 10
>
> So a ~2.4x speedup over the current implementation. Not bad, but on
> another synthetic scene (3 heavily blurred boxes), multithreading is
> actually > 2x slower than single thread !  This was very puzzling,
> but
> I finally figured it out. It's because it's doing almost 4x more
> work.
> When rendering a filtered object, all objects behind that one have to
> be rendered immediately and that intermediate rendering can have a
> larger area than the rendered region itself since filters can access
> neighboring pixels.  For the blur filter, the expanded region was way
> bigger than the rectangle each thread was rendering to! It also
> explains why the current renderer without multithreading is very slow
> when zooming into a heavily blurred region. It's because the
> rendering
> is done in blocks to improve responsiveness. But the block size (64k)
> is too small. Please consider increasing this - make it 1/8 of the
> window height?
>
> I hope this isn't a fundamental problem. Any thoughts on how this
> might be improved?
>
> The other, safer approach is to multithread pixman, but that also has
> lots of challenges and probably won't be as fast for most scenes:
> -lots of small functions to optimize
> -some functions like Cairo's scanline rendering
> (cairo_tor_scanline_converter_generate()) are probably difficult or
> too small to be multithreaded
> -needs more forks/joins (fine grained ||ism). This will be bad on
> Windows, where the pthread wake up latency is 7 times longer than on
> Linux.
>
>
> "I don't know how widespread AVX2 is, or if the 1.3x improvement is a
> large enough benefit to warrant considering it for [pixman]"
>
> AVX2 should be on all Intel processors since summer, 2013 (Haswell).
> I
> measured the speedups again on my desktop with > 2x the memory
> bandwidth of my laptop and it's still quite weak. It must be that
> those functions like blits (2D memcpy), fills, composite_in,
> composite_out, are all bandwidth limited, so wider SIMD isn't much
> help. You might say this bottleneck would contradict the reported
> speedups above, but keep in mind that 1 core alone can't fully use up
> all the memory bandwidth.
>
> I'll have a discussion with the pixman developers to see what they
> think. On a related note, I've also submitted a patch for Windows
> touchscreen support in GTK:
> https://bugzilla.gnome.org/show_bug.cgi?id=776568
>
> -Yale
>
> On Sun, Jan 1, 2017 at 1:42 AM, Bryce Harrington
> <[hidden email]> wrote:
> > On Sat, Dec 31, 2016 at 06:21:40PM -0500, Yale Zhang wrote:
> > > In my quest for a more fluid experience with fewer distractions,
> > > I've
> > > attempted to multithread the rendering. I see this has been a
> > > long
> > > standing discussion:
> > >
> > > https://bugs.launchpad.net/inkscape/+bug/200415
> > > https://bugs.launchpad.net/inkscape/+bug/330271
> > >
> > > I'm using a Lenovo P40 tablet and the total frame rendering time
> > > for a
> > > simple piece is 80 to 100ms (1920x1080, a few hundred vertices on
> > > 2
> > > layers with no filter effects - only alpha compositing). This
> > > slow
> > > rendering speed makes the touchscreen zooming I recently
> > > implemented
> > > very jerky.
> > >
> > > So, I tried multithreading SPCanvas::paintRectInternal() with
> > > OpenMP
> > > by splitting the rectangle into 2 and rendering them in ||.
> > > I used mutual exclusion for some obviously thread unsafe code
> > > like the
> > > call to markRect in SPCanvas::paintSingleBuffer(). The rendering
> > > would
> > > work for a few frames before it freezes (waiting threads timeout
> > > and
> > > then exit).
> > >
> > > Then, I put other calls that I suspected were thread unsafe in
> > > mutually exclusive blocks until I discovered _root->render()
> > > isn't
> > > safe. No point in going further.
> > >
> > > Excuse my naive attempt. Can anyone guess how feasible it is to
> > > multithread the rendering? For now, I don't care if it's pixel
> > > perfect. I just need something that's decent and doesn't
> > > crash/freeze.
> >
> > Yes, various people have looked at multi-threading before, but not
> > founda feasible way to attack it.
> >
> > > I'm also wondering why the Cairo OpenGL backend isn't being used?
> > > GPU
> > > rendering on integrated GPUs should give a nice speedup since
> > > there
> > > should be no copying overhead.
> >
> > On Linux, the cairo library is typically shipped with its GL
> > backend
> > disabled, so that presents sort of a logistical roadblock that'd
> > need
> > solved.  Also, while theoretically you're right it should provide a
> > performance boost, it's not guaranteed.  OpenGL has been
> > experimental in
> > Cairo and not as thoroughly tested as the X and other backends, so
> > there
> > may well be corner cases where performance is poorer.  But no way
> > to
> > know for certain except to hook it up and try it out.  A number of
> > us
> > have had this task on our todo list but I don't think anyone's
> > taken a
> > solid shot at it yet.
> >
> > > The other place to optimize is pixman. I did some profiling
> > > (rapidly
> > > zooming in and out with touchscreen) and >= 25% of the time is
> > > spent
> > > in pixman rendering. I already went ahead and ported a few to
> > > AVX2 and
> > > got ~1.3x speedup (should get more since my laptop is
> > > bottlenecked by
> > > memory bandwidth owing to having only 1 memory channel).
> >
> > Since pixman is low level and widely used, optimzations would be
> > very
> > interesting.  I don't know how widespread AVX2 is, or if the 1.3x
> > improvement is a large enough benefit to warrant considering it for
> > Pixman, though.  Regardless, I'd be interested in learning more of
> > your
> > work along these paths.  Perhaps you'll discover something worth
> > inclusion in upstream codebases?
> >
> > Thanks,
> > Bryce
> >
> > >
> > > Function                                                    Modul
> > > e
> > >          Samples
> > > sse2_blt.part.0
> > > libpixman-1-0.dll    4221
> > > sse2_combine_in_u
> > > libpixman-1-0.dll    2189
> > > sse2_fill
> > > libpixman-1-0.dll    1693
> > > cairo_tor_scan_converter_generate
> > > libcairo-2.dll       1494
> > > sse2_composite_over_8888_8888
> > > libpixman-1-0.dll    1424
> > > bits_image_fetch_separable_convolution_affine_none_a8r8g8b8
> > > libpixman-1-0.dll    1104
> > > feed_curve_to_cairo(_cairo*Geom::Curve const&
> > > libinkscape_base.dll 611
> > > fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC
> > > libpixman-1-0.dll    475
> > > fill_xrgb32_lerp_opaque_spans
> > > libcairo-2.dll       348
> > > cairo_tor_scan_converter_add_polygon
> > > libcairo-2.dll       260
> > > compute_face
> > > libcairo-2.dll       238
> > > _dynamic_cast
> > > libstdc++-6.dll      209
> > > outer_join
> > > libcairo-2.dll       179
> > > cairo_polygon_add_edge
> > > libcairo-2.dll       178
> > > g_hash_table_lookup
> > > libglib-2.0-0.dll    169
> > > cairo_spline_decompose_into
> > > libcairo-2.dll       153
> > > g_slice_alloc
> > > libglib-2.0-0.dll    138
> > > cairo_spline_intersects
> > > libcairo-2.dll       131
> > > feed_pathvector_to_cairo(_cairo*Geom::PathVector    const&)
> > > libinkscape_base.dll 127
> > > line_to
> > > libcairo-2.dll       119
> > > void std::vector<Geom::Pointstd::allocator<Geom::Point>
> > > libinkscape_base.dll 116
> > > cairo_matrix_transform_point
> > > libcairo-2.dll       110
> > > cell_list_render_edge
> > > libcairo-2.dll       106
> > > g_type_check_instance_is_a
> > > libgobject-2.0-0.dll 106
> > >
> > > ---------------------------------------------------------------
> > > ---------------
> > > Check out the vibrant tech community on one of the world's most
> > > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> > > _______________________________________________
> > > Inkscape-devel mailing list
> > > [hidden email]
> > > https://lists.sourceforge.net/lists/listinfo/inkscape-devel
>
> -------------------------------------------------------------------
> -----------
> Check out the vibrant tech community on one of the world's most 
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> _______________________________________________
> Inkscape-devel mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/inkscape-devel

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Inkscape-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/inkscape-devel
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multithreaded rendering

Yale Zhang
Right, I know Krzysztof ported the renderer to Cairo. Also, thanks to
the ThreadSanitizer developers for such a useful tool.

The small block problem is that in SPCanvas::paintRectInternal(), if
the dirty rectangle is bigger than 64k pixels, it gets recursively
split. From what I've seen, the split always seems to be in the
vertical dimension. For a window width of 1500, the height would be <
43 pixels, which would be very inefficient for large blurs.

For my multithreaded testing, I increased the threshold so that it never splits.

        // use 256K as a compromise to not slow down gradients
        // 256K is the cached buffer and we need 4 channels
        setup.max_pixels = 65536; // 256K/4

Also, in my benchmarks, I disabled the rendering cache.




On Mon, Jan 9, 2017 at 5:59 AM, Tavmjong Bah <[hidden email]> wrote:

>
> Hi Yale,
>
> This all looks quite interesting. I've CC'd our resident rendering
> expert, Krzysztof, who can probably give you the best feedback.
>
> I've wondered for some time on how useful splitting the screen up into
> tiles really is. Computers have become much faster than they were since
> that code was written. And your right, it certainly is a big slow down
> when you zoom in close to an object that has a filter as one still must
> calculate a quite large area to handle filters that use a large
> displacement or a large blur radius... and now you have to recalculate
> it multiple times.
>
> I don't have a good feel for what 64k block size really means vs. 1/8
> of a screen. What block size would that require (or conversely, how big
> of an area does a 64k block size correspond to)?
>
> Tav
>
>
> On Sun, 2017-01-08 at 19:38 -0800, Yale Zhang wrote:
>> Thanks for the encouragement. It's both encouraging and intimidating
>> to hear that multithreaded rendering hasn't been seriously attempted
>> before.
>>
>> I have an ace up my sleeve: GCC/LLVM's thread sanitizer. It works
>> like
>> Valgrind/AddressSanitizer but reports race conditions instead of
>> buffer overflows. I figured out how to use it and iteratively
>> eliminated all the fatal race conditions one by one. Patch is
>> attached
>> if anyone wants to try.
>>
>> Here're the speedups I got on a complex scene with lots of filters
>> (unexpected_visitor.svg from the discussion on my vectorized gaussian
>> blur)
>>
>> 1 thread: 4.2s
>> 8 (hyperthread): 1.1
>> 1 thread (8 for filters): 2.6
>>
>> CPU: Intel 4770 @3.4 GHz
>> memory:  2 channels DDR3 @ 1866 MHz
>> OS: Windows 10
>>
>> So a ~2.4x speedup over the current implementation. Not bad, but on
>> another synthetic scene (3 heavily blurred boxes), multithreading is
>> actually > 2x slower than single thread !  This was very puzzling,
>> but
>> I finally figured it out. It's because it's doing almost 4x more
>> work.
>> When rendering a filtered object, all objects behind that one have to
>> be rendered immediately and that intermediate rendering can have a
>> larger area than the rendered region itself since filters can access
>> neighboring pixels.  For the blur filter, the expanded region was way
>> bigger than the rectangle each thread was rendering to! It also
>> explains why the current renderer without multithreading is very slow
>> when zooming into a heavily blurred region. It's because the
>> rendering
>> is done in blocks to improve responsiveness. But the block size (64k)
>> is too small. Please consider increasing this - make it 1/8 of the
>> window height?
>>
>> I hope this isn't a fundamental problem. Any thoughts on how this
>> might be improved?
>>
>> The other, safer approach is to multithread pixman, but that also has
>> lots of challenges and probably won't be as fast for most scenes:
>> -lots of small functions to optimize
>> -some functions like Cairo's scanline rendering
>> (cairo_tor_scanline_converter_generate()) are probably difficult or
>> too small to be multithreaded
>> -needs more forks/joins (fine grained ||ism). This will be bad on
>> Windows, where the pthread wake up latency is 7 times longer than on
>> Linux.
>>
>>
>> "I don't know how widespread AVX2 is, or if the 1.3x improvement is a
>> large enough benefit to warrant considering it for [pixman]"
>>
>> AVX2 should be on all Intel processors since summer, 2013 (Haswell).
>> I
>> measured the speedups again on my desktop with > 2x the memory
>> bandwidth of my laptop and it's still quite weak. It must be that
>> those functions like blits (2D memcpy), fills, composite_in,
>> composite_out, are all bandwidth limited, so wider SIMD isn't much
>> help. You might say this bottleneck would contradict the reported
>> speedups above, but keep in mind that 1 core alone can't fully use up
>> all the memory bandwidth.
>>
>> I'll have a discussion with the pixman developers to see what they
>> think. On a related note, I've also submitted a patch for Windows
>> touchscreen support in GTK:
>> https://bugzilla.gnome.org/show_bug.cgi?id=776568
>>
>> -Yale
>>
>> On Sun, Jan 1, 2017 at 1:42 AM, Bryce Harrington
>> <[hidden email]> wrote:
>> > On Sat, Dec 31, 2016 at 06:21:40PM -0500, Yale Zhang wrote:
>> > > In my quest for a more fluid experience with fewer distractions,
>> > > I've
>> > > attempted to multithread the rendering. I see this has been a
>> > > long
>> > > standing discussion:
>> > >
>> > > https://bugs.launchpad.net/inkscape/+bug/200415
>> > > https://bugs.launchpad.net/inkscape/+bug/330271
>> > >
>> > > I'm using a Lenovo P40 tablet and the total frame rendering time
>> > > for a
>> > > simple piece is 80 to 100ms (1920x1080, a few hundred vertices on
>> > > 2
>> > > layers with no filter effects - only alpha compositing). This
>> > > slow
>> > > rendering speed makes the touchscreen zooming I recently
>> > > implemented
>> > > very jerky.
>> > >
>> > > So, I tried multithreading SPCanvas::paintRectInternal() with
>> > > OpenMP
>> > > by splitting the rectangle into 2 and rendering them in ||.
>> > > I used mutual exclusion for some obviously thread unsafe code
>> > > like the
>> > > call to markRect in SPCanvas::paintSingleBuffer(). The rendering
>> > > would
>> > > work for a few frames before it freezes (waiting threads timeout
>> > > and
>> > > then exit).
>> > >
>> > > Then, I put other calls that I suspected were thread unsafe in
>> > > mutually exclusive blocks until I discovered _root->render()
>> > > isn't
>> > > safe. No point in going further.
>> > >
>> > > Excuse my naive attempt. Can anyone guess how feasible it is to
>> > > multithread the rendering? For now, I don't care if it's pixel
>> > > perfect. I just need something that's decent and doesn't
>> > > crash/freeze.
>> >
>> > Yes, various people have looked at multi-threading before, but not
>> > founda feasible way to attack it.
>> >
>> > > I'm also wondering why the Cairo OpenGL backend isn't being used?
>> > > GPU
>> > > rendering on integrated GPUs should give a nice speedup since
>> > > there
>> > > should be no copying overhead.
>> >
>> > On Linux, the cairo library is typically shipped with its GL
>> > backend
>> > disabled, so that presents sort of a logistical roadblock that'd
>> > need
>> > solved.  Also, while theoretically you're right it should provide a
>> > performance boost, it's not guaranteed.  OpenGL has been
>> > experimental in
>> > Cairo and not as thoroughly tested as the X and other backends, so
>> > there
>> > may well be corner cases where performance is poorer.  But no way
>> > to
>> > know for certain except to hook it up and try it out.  A number of
>> > us
>> > have had this task on our todo list but I don't think anyone's
>> > taken a
>> > solid shot at it yet.
>> >
>> > > The other place to optimize is pixman. I did some profiling
>> > > (rapidly
>> > > zooming in and out with touchscreen) and >= 25% of the time is
>> > > spent
>> > > in pixman rendering. I already went ahead and ported a few to
>> > > AVX2 and
>> > > got ~1.3x speedup (should get more since my laptop is
>> > > bottlenecked by
>> > > memory bandwidth owing to having only 1 memory channel).
>> >
>> > Since pixman is low level and widely used, optimzations would be
>> > very
>> > interesting.  I don't know how widespread AVX2 is, or if the 1.3x
>> > improvement is a large enough benefit to warrant considering it for
>> > Pixman, though.  Regardless, I'd be interested in learning more of
>> > your
>> > work along these paths.  Perhaps you'll discover something worth
>> > inclusion in upstream codebases?
>> >
>> > Thanks,
>> > Bryce
>> >
>> > >
>> > > Function                                                    Modul
>> > > e
>> > >          Samples
>> > > sse2_blt.part.0
>> > > libpixman-1-0.dll    4221
>> > > sse2_combine_in_u
>> > > libpixman-1-0.dll    2189
>> > > sse2_fill
>> > > libpixman-1-0.dll    1693
>> > > cairo_tor_scan_converter_generate
>> > > libcairo-2.dll       1494
>> > > sse2_composite_over_8888_8888
>> > > libpixman-1-0.dll    1424
>> > > bits_image_fetch_separable_convolution_affine_none_a8r8g8b8
>> > > libpixman-1-0.dll    1104
>> > > feed_curve_to_cairo(_cairo*Geom::Curve const&
>> > > libinkscape_base.dll 611
>> > > fast_composite_scaled_bilinear_sse2_8888_8888_cover_SRC
>> > > libpixman-1-0.dll    475
>> > > fill_xrgb32_lerp_opaque_spans
>> > > libcairo-2.dll       348
>> > > cairo_tor_scan_converter_add_polygon
>> > > libcairo-2.dll       260
>> > > compute_face
>> > > libcairo-2.dll       238
>> > > _dynamic_cast
>> > > libstdc++-6.dll      209
>> > > outer_join
>> > > libcairo-2.dll       179
>> > > cairo_polygon_add_edge
>> > > libcairo-2.dll       178
>> > > g_hash_table_lookup
>> > > libglib-2.0-0.dll    169
>> > > cairo_spline_decompose_into
>> > > libcairo-2.dll       153
>> > > g_slice_alloc
>> > > libglib-2.0-0.dll    138
>> > > cairo_spline_intersects
>> > > libcairo-2.dll       131
>> > > feed_pathvector_to_cairo(_cairo*Geom::PathVector    const&)
>> > > libinkscape_base.dll 127
>> > > line_to
>> > > libcairo-2.dll       119
>> > > void std::vector<Geom::Pointstd::allocator<Geom::Point>
>> > > libinkscape_base.dll 116
>> > > cairo_matrix_transform_point
>> > > libcairo-2.dll       110
>> > > cell_list_render_edge
>> > > libcairo-2.dll       106
>> > > g_type_check_instance_is_a
>> > > libgobject-2.0-0.dll 106
>> > >
>> > > ---------------------------------------------------------------
>> > > ---------------
>> > > Check out the vibrant tech community on one of the world's most
>> > > engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> > > _______________________________________________
>> > > Inkscape-devel mailing list
>> > > [hidden email]
>> > > https://lists.sourceforge.net/lists/listinfo/inkscape-devel
>>
>> -------------------------------------------------------------------
>> -----------
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>> _______________________________________________
>> Inkscape-devel mailing list
>> [hidden email]
>> https://lists.sourceforge.net/lists/listinfo/inkscape-devel

------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today. http://sdm.link/xeonphi
_______________________________________________
Inkscape-devel mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/inkscape-devel
Loading...