I have started publishing during my master's studies in 2009.
Most of my publications are on GPU algorithms, GPU work scheduling and related topics with a focus on rendering.
However, I am also publishing in the fields of visualization, information visualization, human computer interaction, and augmented reality.
For a current list of citations see my Google Scholar page.
In the following you will find my self-maintained list of publications with some additional information and material about them.
If you have any questions about the individual publications do not hesitate to contact me.
Selection:
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
C38
Mathias Parger, Chengcheng Tang, Thomas Neff, Christopher D. Twigg, Cem Keskin, Robert Wang, Markus Steinberger:
Abstract: Convolutional neural network inference on video input is computationally expensive and requires high memory bandwidth. Recently, DeltaCNN managed to reduce the cost by only processing pixels with significant updates over the previous frame. However, DeltaCNN relies on static camera input. Moving cameras add new challenges in how to fuse newly unveiled image regions with already processed regions efficiently to minimize the update rate - without increasing memory overhead and without knowing the camera extrinsics of future frames. In this work, we propose MotionDeltaCNN, a sparse CNN inference framework that supports moving cameras. We introduce spherical buffers and padded convolutions to enable seamless fusion of newly unveiled regions and previously processed regions – without increasing memory footprint. Our evaluation shows that we outperform DeltaCNN by up to 90% for moving camera videos.
Proceedings of the IEEE/CVF International on Computer Vision , 2023
@inproceedings{parger2023motiondeltacnn,
title={MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions},
author={Parger, Mathias and Tang, Chengcheng and Neff, Thomas and Twigg, Christopher D and Keskin, Cem and Wang, Robert and Steinberger, Markus},
booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
pages={17292--17301},
year={2023}
}
Alexander Weinrauch, Wolfgang Tatzgern, Pascal Stadlbauer, Alexis Crickx, Jozef Hladky, Arno Coomans, Martin Winter, Joerg H. Mueller, Markus Steinberger:
Abstract: With cloud computing becoming ubiquitous, it appears as virtually everything can be offered as-a-service. However, real-time rendering in the cloud forms a notable exception, where the cloud adoption stops at running individual game instances in compute centers. In this paper, we explore whether a cloud-native rendering architecture is viable and scales to multi-client rendering scenarios. To this end, we propose world-space and on-surface caches to share rendering computations among viewers placed in the same virtual world. We discuss how caches can be utilized on an effect-basis and demonstrate that a large amount of computations can be saved as the number of viewers in a scene increases. Caches can easily be set up for various effects, including ambient occlusion, direct illumination, and diffuse global illumination. Our results underline that the image quality using cached rendering is on par with screen-space rendering and due to its simplicity and inherent coherence, cached rendering may even have advantages in single viewer setups. Analyzing the runtime and communication costs, we show that cached rendering is already viable in multi-GPU systems. Building on top of our research, cloud-native rendering may be just around the corner.
ACM Transactions on Graphics (SIGGRAPH '23), 2023
@article{10.1145/3592431,
author = {Weinrauch, Alexander and Tatzgern, Wolfgang and Stadlbauer, Pascal and Crickx, Alexis and Hladky, Jozef and Coomans, Arno and Winter, Martin and Mueller, Joerg H. and Steinberger, Markus},
title = {Effect-Based Multi-Viewer Caching for Cloud-Native Rendering},
year = {2023},
issue_date = {August 2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {42},
number = {4},
issn = {0730-0301},
url = {https://doi.org/10.1145/3592431},
doi = {10.1145/3592431},
abstract = {With cloud computing becoming ubiquitous, it appears as virtually everything can be offered as-a-service. However, real-time rendering in the cloud forms a notable exception, where the cloud adoption stops at running individual game instances in compute centers. In this paper, we explore whether a cloud-native rendering architecture is viable and scales to multi-client rendering scenarios. To this end, we propose world-space and on-surface caches to share rendering computations among viewers placed in the same virtual world. We discuss how caches can be utilized on an effect-basis and demonstrate that a large amount of computations can be saved as the number of viewers in a scene increases. Caches can easily be set up for various effects, including ambient occlusion, direct illumination, and diffuse global illumination. Our results underline that the image quality using cached rendering is on par with screen-space rendering and due to its simplicity and inherent coherence, cached rendering may even have advantages in single viewer setups. Analyzing the runtime and communication costs, we show that cached rendering is already viable in multi-GPU systems. Building on top of our research, cloud-native rendering may be just around the corner.},
journal = {ACM Trans. Graph.},
month = {jul},
articleno = {87},
numpages = {16},
keywords = {cloud computing, scalability, real-time rendering, ray tracing, distributed rendering}
}
Abstract: Visibility computation is a key element in computer graphics applications. More specifically, a from-region potentially visible set (PVS) is an established tool in rendering acceleration, but its high computational cost means a from-region PVS is almost always precomputed. Precomputation restricts the use of PVS to static scenes and leads to high storage cost, in particular, if we need fine-grained regions. For dynamic applications, such as streaming content over a variable-bandwidth network, online PVS computation with configurable region size is required. We address this need with trim regions, a new method for generating from-region PVS for arbitrary scenes in real time. Trim regions perform controlled erosion of object silhouettes in image space, implicitly applying the shrinking theorem known from previous work. Our algorithm is the first that applies automatic shrinking to unconstrained 3D scenes, including non-manifold meshes, and does so in real time using an efficient GPU execution model. We demonstrate that our algorithm generates a tight PVS for complex scenes and outperforms previous online methods for from-viewpoint and from-region PVS. It runs at 60 Hz for realistic game scenes consisting of millions of triangles and computes PVS with a tightness matching or surpassing existing approaches.
ACM Transactions on Graphics (SIGGRAPH '23), 2023
@article{10.1145/3592434,
author = {Voglreiter, Philip and Kerbl, Bernhard and Weinrauch, Alexander and Mueller, Joerg Hermann and Neff, Thomas and Steinberger, Markus and Schmalstieg, Dieter},
title = {Trim Regions for Online Computation of From-Region Potentially Visible Sets},
year = {2023},
issue_date = {August 2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {42},
number = {4},
issn = {0730-0301},
url = {https://doi.org/10.1145/3592434},
doi = {10.1145/3592434},
abstract = {Visibility computation is a key element in computer graphics applications. More specifically, a from-region potentially visible set (PVS) is an established tool in rendering acceleration, but its high computational cost means a from-region PVS is almost always precomputed. Precomputation restricts the use of PVS to static scenes and leads to high storage cost, in particular, if we need fine-grained regions. For dynamic applications, such as streaming content over a variable-bandwidth network, online PVS computation with configurable region size is required. We address this need with trim regions, a new method for generating from-region PVS for arbitrary scenes in real time. Trim regions perform controlled erosion of object silhouettes in image space, implicitly applying the shrinking theorem known from previous work. Our algorithm is the first that applies automatic shrinking to unconstrained 3D scenes, including non-manifold meshes, and does so in real time using an efficient GPU execution model. We demonstrate that our algorithm generates a tight PVS for complex scenes and outperforms previous online methods for from-viewpoint and from-region PVS. It runs at 60 Hz for realistic game scenes consisting of millions of triangles and computes PVS with a tightness matching or surpassing existing approaches.},
journal = {ACM Trans. Graph.},
month = {jul},
articleno = {85},
numpages = {15},
keywords = {real-time, potentially visible sets}
}
Abstract: Even though stochastic methods and hardware supported ray tracing are increasingly used for computing direct illumination, the efficient real-time rendering of dynamic area light sources still forms a challenge. In this paper, we propose a method for representing and caching direct illumination information using a compact multi-cone representation that is stored on the surface of objects. While shading due to direct illumination is typically heavily view-dependent, the incoming radiance for surface points is view-independent. Relying on cones, to represent the projection of the dominant visible light sources, allows to reuse the incoming radiance information across frames and even among multiple cameras or viewers within the same scene. Progressively refining and updating the cone structures not only allows to adapt to dynamic scenes, but also leads to reduced noise levels in the output images compared to sampling based methods. Relying on surface light cones allows to render single viewer setups 2-3x faster than random sampling, and 1.5-2x faster than reservoir-based sampling with the same quality. The main selling point for surface light cones is multi-camera rendering, For stereo rendering, our approach essentially halves the time required for determining direct light visibility. For rendering in the cloud, where multiple viewers are positioned close to another, such as in virtual meetings, gathering locations in games, or online events such as virtual concerts, our approach can reduce overall rendering times by a factor of 20x for as few as 16 viewers in a scene compared to traditional light sampling. Finally, under heavily constraint ray budgets where noise levels typically overshadow bias, surface light cones can dramatically reduce noise.
@inproceedings {10.2312:hpg.20231137,
booktitle = {High-Performance Graphics - Symposium Papers},
editor = {Bikker, Jacco and Gribble, Christiaan},
title = {{Surface Light Cones: Sharing Direct Illumination for Efficient Multi-viewer Rendering}},
author = {Stadlbauer, Pascal and Weinrauch, Alexander and Tatzgern, Wolfgang and Steinberger, Markus},
year = {2023},
publisher = {The Eurographics Association},
ISSN = {2079-8687},
ISBN = {978-3-03868-229-5},
DOI = {10.2312/hpg.20231137}
}
Abstract: Achieving realism in modern games requires the integration of participating media effects, such as fog, dust, and smoke. However, due to the complex nature of scattering and partial occlusions within these media, real-time rendering of high-quality participating media remains a computational challenge. To address this challenge, traditional approaches of real-time participating media rendering involve storing temporary results in a view-aligned grid before ray marching through these cached values. In this paper, we investigate alternative hybrid worldand view-aligned caching methods that allow for the sharing of intermediate computations across cameras in a scene. This approach is particularly relevant for multi-camera setups, such as stereo rendering for VR and AR, local split-screen games, or cloud-based rendering for game streaming, where a large number of players may be in the same location. Our approach relies on a view-aligned grid for near-field computations, which enables us to capture high-frequency shadows in front of a viewer. Additionally, we use a world-space caching structure to selectively activate distant computations based on each viewer's visibility, allowing for the sharing of computations and maintaining high visual quality. The results of our evaluation demonstrate computational savings of up to 50% or more, without compromising visual quality.
@inproceedings {10.2312:hpg.20231136,
booktitle = {High-Performance Graphics - Symposium Papers},
editor = {Bikker, Jacco and Gribble, Christiaan},
title = {{Efficient Rendering of Participating Media for Multiple Viewpoints}},
author = {Stojanovic, Robert and Weinrauch, Alexander and Tatzgern, Wolfgang and Kurz, Andreas and Steinberger, Markus},
year = {2023},
publisher = {The Eurographics Association},
ISSN = {2079-8687},
ISBN = {978-3-03868-229-5},
DOI = {10.2312/hpg.20231136}
}
Abstract: Recent advances in graphics hardware have enabled ray tracing to produce high-quality ambient occlusion (AO) in real-time, which is not plagued by the artifacts typically found in real-time screen-space approaches. However, the high computational cost of ray tracing remains a significant hurdle for low-power devices like standalone VR headsets or smartphones. To address this challenge, inspired by point-based global illumination and texture-space split rendering, we propose point-based split ambient occlusion (PSAO), a novel split-rendering system that streams points sparsely from server to client. PSAO first evenly distributes points across the scene, and then subsequently only transmits points that changed more than a given threshold, using an efficient hash grid to blend neighboring points for the final compositing pass on the client. PSAO outperforms recent texture-space shading approaches in terms of quality and required network bit rate, while demonstrating performance similar to commonly used lower-quality screen-space approaches. Our point-based split rendering representation lends itself to highly compressible signals such as AO and is scalable towards quality or bandwidth requirements by adjusting the number of points in the scene.
@inproceedings {10.2312:hpg.20231131,
booktitle = {High-Performance Graphics - Symposium Papers},
editor = {Bikker, Jacco and Gribble, Christiaan},
title = {{PSAO: Point-Based Split Rendering for Ambient Occlusion}},
author = {Neff, Thomas and Budge, Brian and Dong, Zhao and Schmalstieg, Dieter and Steinberger, Markus},
year = {2023},
publisher = {The Eurographics Association},
ISSN = {2079-8687},
ISBN = {978-3-03868-229-5},
DOI = {10.2312/hpg.20231131}
}
Abstract: Volumetric clouds play a crucial role in creating realistic, dynamic, and immersive virtual outdoor environments. However, rendering volumetric clouds in real-time presents a significant computational challenge on end-user devices. In this paper, we investigate the viability of moving computations to remote servers in the cloud and sharing them among many viewers in the same virtual world, without compromising the perceived quality of the final renderings. We propose an efficient rendering method for volumetric clouds and cloud shadows utilizing caches placed in the cloud layers and directly on the surface of objects. Volumetric cloud properties, like density and lightning, are cached on spheres positioned to represent cloud layers at varying heights. Volumetric cloud shadows are cached directly on the surfaces of receiving objects. This allows efficient rendering in scenarios where multiple viewers observe the same cloud formations by sharing redundant calculations and storing them over multiple frames. Due to the placement and structure of our caches, viewers on the ground still perceive plausible parallax under movement on the ground. In a user study, we found that viewers hardly perceive quality reductions even when computations are shared for viewers that are hundreds of meters apart. Due to the smoothness of the appearance of clouds, caching structures can use significantly reduced resolution and as such allow for efficient rendering even in single-viewer scenarios. Our quantitative experiments demonstrate computational cost savings proportional to the number of viewers placed in the scene when relying on our caches compared to traditional rendering.
@inproceedings {10.2312:hpg.20231138,
booktitle = {High-Performance Graphics - Symposium Papers},
editor = {Bikker, Jacco and Gribble, Christiaan},
title = {{Clouds in the Cloud: Efficient Cloud-Based Rendering of Real-Time Volumetric Clouds}},
author = {Weinrauch, Alexander and Lorbek, Stephan and Tatzgern, Wolfgang and Stadlbauer, Pascal and Steinberger, Markus},
year = {2023},
publisher = {The Eurographics Association},
ISSN = {2079-8687},
ISBN = {978-3-03868-229-5},
DOI = {10.2312/hpg.20231138}
}
Abstract: Aspects presented herein relate to methods and devices for graphics processing including an apparatus, e.g., a GPU. The apparatus may divide at least one scene into a plurality of meshlets, each of the meshlets including a plurality of primitives, and each of the primitives including plurality of vertices. The apparatus may also calculate a pair of texture coordinates for each of the plurality of vertices. Further, the apparatus may select a size of each of the plurality of meshlets in the at least one scene based on the pair of the texture coordinates and based on a perspective projection of each of the plurality of meshlets. The apparatus may also calculate layout information in a meshlet atlas for each of the meshlets in the at least one scene. Moreover, the apparatus may shade each of a plurality of pixels in the meshlet atlas based on the calculated layout information.
US Patent App. 17/934,159, 2023
@misc{neff2023meshlet,
title={Meshlet shading atlas},
author={Neff, Thomas and M{"u}ller, J{"o}rg Hermann and Steinberger, Markus and Schmalstieg, Dieter},
year={2023},
month=mar #,
publisher={Google Patents},
note={US Patent App. 17/934,159}
}
Abstract: Concurrent queue algorithms have been subject to extensive research. However, the target hardware and evaluation methodology on which the published results for any two given concurrent queue algorithms are based often share only minimal overlap. A meaningful comparison is, thus, exceedingly difficult. With the continuing trend towards more and more heterogeneous systems, it is becoming more and more important to not only evaluate and compare novel and existing queue algorithms across a wider range of target architectures, but to also be able to continuously re-evaluate queue algorithms in light of novel architectures and capabilities.To address this need, we present AnyQ, an evaluation framework for concurrent queue algorithms. We design a set of programming abstractions that enable the mapping of concurrent queue algorithms and benchmarks to a wide variety of target architectures. We demonstrate the effectiveness of these abstractions by showing that a queue algorithm expressed in a portable, high-level manner can achieve performance comparable to hand-crafted implementations. We design a system for testing and benchmarking queue algorithms. Using the developed framework, we investigate concurrent queue algorithm performance across a range of both CPU as well as GPU architectures. In hopes that it may serve the community as a starting point for building a common repository of concurrent queue algorithms as well as a base for future research, all code and data is made available as open source software at https://anydsl.github.io/anyq.
IEEE International Parallel and Distributed Processing Symposium (IPDPS'23), 2023
@INPROCEEDINGS{10177434,
author={Kenzel, Michael and Lemme, Stefan and Membarth, Richard and Kurtenacker, Matthias and Devillers, Hugo and Steinberger, Markus and Slusallek, Philipp},
booktitle={2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
title={AnyQ: An Evaluation Framework for Massively-Parallel Queue Algorithms},
year={2023},
volume={},
number={},
pages={736-745},
doi={10.1109/IPDPS54959.2023.00079}}
Abstract: The humble loop shrinking property played a central role in the inception of modern topology but it has been eclipsed by more abstract algebraic formalisms. This is particularly true in the context of detecting relevant non-contractible loops on surfaces where elaborate homological and/or graph theoretical constructs are favored in algorithmic solutions. In this work, we devise a variational analogy to the loop shrinking property and show that it yields a simple, intuitive, yet powerful solution allowing a streamlined treatment of the problem of handle and tunnel loop detection. Our formalization tracks the evolution of a diffusion front randomly initiated on a single location on the surface. Capitalizing on a diffuse interface representation combined with a set of rules for concurrent front interactions, we develop a dynamic data structure for tracking the evolution on the surface encoded as a sparse matrix which serves for performing both diffusion numerics and loop detection and acts as the workhorse of our fully parallel implementation. The substantiated results suggest our approach outperforms state of the art and robustly copes with highly detailed geometric models. As a byproduct, our approach can be used to construct Reeb graphs by diffusion thus avoiding commonly encountered issues when using Morse functions.
Computer Graphics Forum (EG'23), 2023
@article{https://doi.org/10.1111/cgf.14763,
author = {Weinrauch, Alexander. and Mlakar, Daniel. and Seidel, Hans-Peter. and Steinberger, Markus. and Zayer, Rhaleb.},
title = {A Variational Loop Shrinking Analogy for Handle and Tunnel Detection and Reeb Graph Construction on Surfaces},
journal = {Computer Graphics Forum},
volume = {42},
number = {2},
pages = {309-320},
keywords = {CCS Concepts, • Computing methodologies → Shape analysis, Massively parallel algorithms},
doi = {https://doi.org/10.1111/cgf.14763},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.14763},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.14763},
abstract = {Abstract The humble loop shrinking property played a central role in the inception of modern topology but it has been eclipsed by more abstract algebraic formalisms. This is particularly true in the context of detecting relevant non-contractible loops on surfaces where elaborate homological and/or graph theoretical constructs are favored in algorithmic solutions. In this work, we devise a variational analogy to the loop shrinking property and show that it yields a simple, intuitive, yet powerful solution allowing a streamlined treatment of the problem of handle and tunnel loop detection. Our formalization tracks the evolution of a diffusion front randomly initiated on a single location on the surface. Capitalizing on a diffuse interface representation combined with a set of rules for concurrent front interactions, we develop a dynamic data structure for tracking the evolution on the surface encoded as a sparse matrix which serves for performing both diffusion numerics and loop detection and acts as the workhorse of our fully parallel implementation. The substantiated results suggest our approach outperforms state of the art and robustly copes with highly detailed geometric models. As a byproduct, our approach can be used to construct Reeb graphs by diffusion thus avoiding commonly encountered issues when using Morse functions.},
year = {2023}
}
Abstract: Streaming rendered 3D content over a network to a thin client device, such as a phone or a VR/AR headset, brings high-fidelity graphics to platforms where it would not normally possible due to thermal, power, or cost constraints. Streamed 3D content must be transmitted with a representation that is both robust to latency and potential network dropouts. Transmitting a video stream and reprojecting to correct for changing viewpoints fails in the presence of disocclusion events; streaming scene geometry and performing high-quality rendering on the client is not possible on limited-power mobile GPUs. To balance the competing goals of disocclusion robustness and minimal client workload, we introduce QuadStream, a new streaming content representation that reduces motion-to-photon latency by allowing clients to efficiently render novel views without artifacts caused by disocclusion events. Motivated by traditional macroblock approaches to video codec design, we decompose the scene seen from positions in a view cell into a series of quad proxies, or view-aligned quads from multiple views. By operating on a rasterized G-Buffer, our approach is independent of the representation used for the scene itself; the resulting QuadStream is an approximate geometric representation of the scene that can be reconstructed by a thin client to render both the current view and nearby adjacent views. Our technical contributions are an efficient parallel quad generation, merging, and packing strategy for proxy views covering potential client movement in a scene; a packing and encoding strategy that allows masked quads with depth information to be transmitted as a frame-coherent stream; and an efficient rendering approach for rendering our QuadStream representation into entirely novel views on thin clients. We show that our approach achieves superior quality compared both to video data streaming methods, and to geometry-based streaming.
ACM Transactions on Graphics (SIGGRAPH Asia'22), 2022
@article{10.1145/3550454.3555524,
author = {Hladky, Jozef and Stengel, Michael and Vining, Nicholas and Kerbl, Bernhard and Seidel, Hans-Peter and Steinberger, Markus},
title = {QuadStream: A Quad-Based Scene Streaming Architecture for Novel Viewpoint Reconstruction},
year = {2022},
issue_date = {December 2022},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {41},
number = {6},
issn = {0730-0301},
url = {https://doi.org/10.1145/3550454.3555524},
doi = {10.1145/3550454.3555524},
month = {nov},
articleno = {233},
numpages = {13}
}
Abstract: Novel view synthesis has recently been revolutionized by learning neural radiance fields directly from sparse observations. However, rendering images with this new paradigm is slow due to the fact that an accurate quadrature of the volume rendering equation requires a large number of samples for each ray. Previous work has mainly focused on speeding up the network evaluations that are associated with each sample point, e.g., via caching of radiance values into explicit spatial data structures, but this comes at the expense of model compactness. In this paper, we propose a novel dual-network architecture that takes an orthogonal direction by learning how to best reduce the number of required sample points. To this end, we split our network into a sampling and shading network that are jointly trained. Our training scheme employs fixed sample positions along each ray, and incrementally introduces sparsity throughout training to achieve high quality even at low sample counts. After fine-tuning with the target number of samples, the resulting compact neural representation can be rendered in real-time. Our experiments demonstrate that our approach outperforms concurrent compact neural representations in terms of quality and frame rate and performs on par with highly efficient hybrid representations.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
@inproceedings{kurz2022adanerf,
title={AdaNeRF: Adaptive Sampling for Real-Time Rendering of Neural Radiance Fields},
author={Kurz, Andreas and Neff, Thomas and Lv, Zhaoyang and Zollh{"o}fer, Michael and Steinberger, Markus},
booktitle={European Conference on Computer Vision},
pages={254--270},
year={2022},
organization={Springer}
}
Abstract: The present disclosure relates to methods and apparatus for graphics processing. The apparatus can determine geometry information for each of a plurality of primitives associated with a viewpoint in a scene. The apparatus can also calculate at least one of surface information and disocclusion information based on the geometry information for each of the plurality of primitives, where the surface information and the disocclusion information may be associated with a volumetric grid based on a viewing area corresponding to the viewpoint. Also, the apparatus can calculate visibility information for each of the plurality of primitives based on at least one of the surface information and the disocclusion information, where the visibility information may be associated with the volumetric grid. The apparatus can also determine whether each of the plurality of primitives is visible based on the visibility information for each of the plurality of primitives.
US Patent 11,380,047, 2022
@misc{voglreiter2022methods,
title={Methods and apparatus for order-independent occlusion computations},
author={Voglreiter, Philip and Schmalstieg, Dieter and Steinberger, Markus},
year={2022},
month=jul,
publisher={Google Patents},
note={US Patent 11,380,047}
}
Abstract: Commonly used image-space layouts of shading points, such as used in deferred shading, are strictly view-dependent, which restricts efficient caching and temporal amortization. In contrast, texture-space layouts can represent shading on all surface points and can be tailored to the needs of a particular application. However, the best grouping of shading points—which we call a shading unit—in texture space remains unclear. Choices of shading unit granularity (how many primitives or pixels per unit) and in shading unit parametrization (how to assign texture coordinates to shading points) lead to different outcomes in terms of final image quality, overshading cost, and memory consumption. Among the possible choices, shading units consisting of larger groups of scene primitives, so-called meshlets, remain unexplored as of yet. In this paper, we introduce a taxonomy for analyzing existing texture-space shading methods based on the group size and parametrization of shading units. Furthermore, we introduce a novel texture-space layout strategy that operates on large shading units: the meshlet shading atlas. We experimentally demonstrate that the meshlet shading atlas outperforms previous approaches in terms of image quality, run-time performance and temporal upsampling for a given number of fragment shader invocations. The meshlet shading atlas lends itself to work together with popular cluster-based rendering of meshes with high geometric detail.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
@inproceedings{parger2022deltacnn,
title={DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos},
author={Parger, Mathias and Tang, Chengcheng and Twigg, Christopher D and Keskin, Cem and Wang, Robert and Steinberger, Markus},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={12497--12506},
year={2022}
}
Abstract: Commonly used image-space layouts of shading points, such as used in deferred shading, are strictly view-dependent, which restricts efficient caching and temporal amortization. In contrast, texture-space layouts can represent shading on all surface points and can be tailored to the needs of a particular application. However, the best grouping of shading points—which we call a shading unit—in texture space remains unclear. Choices of shading unit granularity (how many primitives or pixels per unit) and in shading unit parametrization (how to assign texture coordinates to shading points) lead to different outcomes in terms of final image quality, overshading cost, and memory consumption. Among the possible choices, shading units consisting of larger groups of scene primitives, so-called meshlets, remain unexplored as of yet. In this paper, we introduce a taxonomy for analyzing existing texture-space shading methods based on the group size and parametrization of shading units. Furthermore, we introduce a novel texture-space layout strategy that operates on large shading units: the meshlet shading atlas. We experimentally demonstrate that the meshlet shading atlas outperforms previous approaches in terms of image quality, run-time performance and temporal upsampling for a given number of fragment shader invocations. The meshlet shading atlas lends itself to work together with popular cluster-based rendering of meshes with high geometric detail.
Computer Graphics Forum (EG'22), 2022
@article{https://doi.org/10.1111/cgf.14474,
author = {Neff, T. and Mueller, J. H. and Steinberger, M. and Schmalstieg, D.},
title = {Meshlets and How to Shade Them: A Study on Texture-Space Shading},
journal = {Computer Graphics Forum},
volume = {41},
number = {2},
pages = {277-287},
keywords = {CCS Concepts, Computing methodologies Rendering, Texturing},
doi = {https://doi.org/10.1111/cgf.14474},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.14474},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.14474},
year = {2022}
}
Outline: Since its inception, the CUDA programming model has been continuously evolving. Because the CUDA toolkit aims to consistently expose cutting-edge capabilities for general purpose compute jobs to its users, the added features in each new version reflect the rapid changes that we observe in GPU architectures. Over the years, the changes in hardware, a growing scope of built-in functions and libraries, as well as an advancing C++ standard compliance have expanded the design choices when coding for CUDA, and significantly altered the directives to achieve peak performance. In this tutorial, we give a thorough introduction to the CUDA toolkit, demonstrate how a contemporary application can benefit from recently introduced features and how they can be applied to task-based GPU scheduling in particular.
To provide a profound understanding of how CUDA applications can achieve peak performance, Part 1 of this tutorial outlines the modern CUDA architecture. Following a basic introduction, we expose how language features are linked to – and constrained by – the underlying physical hardware components. Furthermore, we describe common applications for massively parallel programming, offer a detailed breakdown of potential issues and list ways to mitigate performance impacts. An examplary analysis of PTX and SASS snippets illustrates how code patterns in CUDA are mapped to actual hardware instructions.
In Part 2, we will focus on novel features that were enabled by the arrival of CUDA 10+ toolkits and the Volta+ architectures, such as ITS, tensor cores and the graph API. In addition to basic use case demonstrations, we outline our own experiences with these capabilities and their potential performance benefits. We also discuss how long-standing best practices are affected by these changes and describe common caveats for dealing with legacy code on recent GPU models. We show how these considerations can be implemented in practice by presenting state-of-the-art research into task-based GPU scheduling, and how the dynamic adjustment of thread roles and group configurations can significantly increase performance.
Eurographics Tutorials'22, 2022
PA07
Dieter Schmalstieg, Markus Steinberger, Wolfgang Tatzgern:
Abstract: The present disclosure relates to methods and apparatus for graphics processing. The apparatus may configure a plurality of billboards associated with a viewpoint of a first frame of a plurality of frames, the plurality of billboards being configured in one or more layers at least partially around the viewpoint, the configuration of the plurality of billboards being based on one or more volumetric elements between at least one of the plurality of billboards and the viewpoint. The apparatus may also render an image associated with each of the one or more volumetric elements between at least one billboard of the plurality of billboards and the viewpoint, the rendered image including a set of pixels. The apparatus may also store data in the at least one billboard based on the rendered image associated with each of the one or more volumetric elements, the data corresponding to the set of pixels.
US Patent App. 17/400,031, 2022
@misc{schmalstieg2022billboard,
title={Billboard layers in object-space rendering},
author={Schmalstieg, Dieter and Steinberger, Markus and Tatzgern, Wolfgang},
year={2022},
month=feb,
publisher={Google Patents},
note={US Patent App. 17/400,031}
}
Abstract: The present disclosure relates to methods and apparatus for graphics processing at a server and/or a client device. In some aspects, the apparatus may convert application data for at least one frame, the application data corresponding to one or more image functions or one or more data channels. The apparatus may also encode the application data for the at least one frame, the application data being associated with a data stream, the application data being encoded via a video encoding process. The apparatus may also transmit the encoded application data for the at least one frame. Additionally, the apparatus may receive application data for at least one frame, the application data being associated with a data stream. The apparatus may also decode the application data for the at least one frame; and convert the application data for the at least one frame.
US Patent App. 17/400,048, 2022
@misc{schmalstieg2022image,
title={Image-space function transmission},
author={Schmalstieg, Dieter and Stadlbauer, Pascal and Steinberger, Markus},
year={2022},
month=feb,
publisher={Google Patents},
note={US Patent App. 17/400,048}
}
Abstract: The present disclosure relates to methods and apparatus for graphics processing. The apparatus may identify at least one mesh associated with at least one frame. The apparatus may also divide the at least one mesh into a plurality of groups of primitives, each of the plurality of groups of primitives including at least one primitive and a plurality of vertices. The apparatus may also compress the plurality of groups of primitives into a plurality of groups of compressed primitives, the plurality of groups of compressed primitives being associated with random access. Additionally, the apparatus may decompress the plurality of groups of compressed primitives, at least one first group of the plurality of groups of compressed primitives being decompressed in parallel with at least one second group of the plurality of groups of compressed primitives.
US Patent App. 17/400,065, 2022
@misc{schmalstieg2022compressed,
title={Compressed geometry rendering and streaming},
author={Schmalstieg, Dieter and Steinberger, Markus and Mlakar, Daniel},
year={2022},
month=feb,
publisher={Google Patents},
note={US Patent App. 17/400,065}
}
Abstract: Van Emde Boas trees show an asymptotic query complexity surpassing the performance of traditional data structure for performing search queries in large data sets. However, their implementation and large constant overheads prohibit their widespread use. In this work, we ask, whether van Emde Boas trees are viable on the GPU. We presents a novel algorithm to construct a van Emde Boas tree utilizing the parallel compute power and memory bandwidth of modern GPUs. By analyzing the structure of a sorted data set, our method is able to build a van Emde Boas tree efficiently in parallel with little thread synchronization. We compare querying data using a van Emde Boas tree and binary search on the GPU and show that for very large data sets, the van Emde Boas tree outperforms a binary search by up to 1.2x while similarly increasing space requirements. Overall, we confirm that van Emde Boas trees are viable for certain scenarios on the GPU.
IEEE High Performance Extreme Computing, 2021
@INPROCEEDINGS{9622837,
author={Mayr, Benedikt and Weinrauch, Alexander and Parger, Mathias and Steinberger, Markus},
booktitle={2021 IEEE High Performance Extreme Computing Conference (HPEC)},
title={Are van Emde Boas trees viable on the GPU?},
year={2021},
volume={},
number={},
pages={1-7},
doi={10.1109/HPEC49654.2021.9622837}
}
Abstract: The recent research explosion around Neural Radiance Fields (NeRFs) shows that there is immense potential for implicitly storing scene and lighting information in neural networks, e.g., for novel view generation. However, one major limitation preventing the widespread use of NeRFs is the prohibitive computational cost of excessive network evaluations along each view ray, requiring dozens of petaFLOPS when aiming for real-time rendering on current devices. We show that the number of samples required for each view ray can be significantly reduced when local samples are placed around surfaces in the scene. To this end, we propose a depth oracle network, which predicts ray sample locations for each view ray with a single network evaluation. We show that using a classification network around logarithmically discretized and spherically warped depth values is essential to encode surface locations rather than directly estimating depth. The combination of these techniques leads to DONeRF, a dual network design with a depth oracle network as a first step and a locally sampled shading network for ray accumulation. With our design, we reduce the inference costs by up to 48x compared to NeRF. Using an off-the-shelf inference API in combination with simple compute kernels, we are the first to render raymarching-based neural representations at interactive frame rates (15 frames per second at 800x800) on a single GPU. At the same time, since we focus on the important parts of the scene around surfaces, we achieve equal or better quality compared to NeRF.
Computer Graphics Forum (EGSR'21), 2021
@article {neff2021donerf,
journal = {Computer Graphics Forum},
title = {{DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks}},
author = {Neff, Thomas and Stadlbauer, Pascal and Parger, Mathias and Kurz, Andreas and Mueller, Joerg H. and Chaitanya, Chakravarty R. Alla and Kaplanyan, Anton S. and Steinberger, Markus},
year = {2021},
publisher = {The Eurographics Association and John Wiley & Sons Ltd.},
ISSN = {1467-8659},
DOI = {10.1111/cgf.14340},
url = {https://doi.org/10.1111/cgf.14340},
volume = {40},
number = {4},
}
Abstract: Tracking body and hand motions in 3D space is essential for social and self-presence in augmented and virtual environments. Unlike the popular 3D pose estimation setting, the problem is often formulated as egocentric tracking based on embodied perception (e.g., egocentric cameras, handheld sensors). In this article, we propose a new data-driven framework for egocentric body tracking, targeting challenges of omnipresent occlusions in optimization-based methods (e.g., inverse kinematics solvers). We first collect a large-scale motion capture dataset with both body and finger motions using optical markers and inertial sensors. This dataset focuses on social scenarios and captures ground truth poses under self-occlusions and body-hand interactions. We then simulate the occlusion patterns in head-mounted camera views on the captured ground truth using a ray casting algorithm and learn a deep neural network to infer the occluded body parts. Our experiments show that our method is able to generate high-fidelity embodied poses by applying the proposed method to the task of real-time egocentric body tracking, finger motion synthesis, and 3-point inverse kinematics.
IEEE Transactions on Visualization and Computer Graphics, 2021
@inproceedings{9444887,
author={Parger, Mathias and Tang, Chengcheng and Xu, Yuanlu and Twigg, Christopher D. and Tao, Lingling and Li, Yijing and Wang, Robert and Steinberger, Markus},
journal={IEEE Transactions on Visualization and Computer Graphics},
title={UNOC: Understanding Occlusion for Embodied Presence in Virtual Reality},
year={2022},
volume={28},
number={12},
pages={4240-4251},
doi={10.1109/TVCG.2021.3085407}
}
Abstract: Bandwidth reduction of sparse matrices is used
to reduce fill-in of linear solvers and to increase performance
of other sparse matrix operations, e.g., sparse matrix vector
multiplication in iterative solvers. To compute a bandwidth
reducing permutation, Reverse Cuthill-McKee (RCM) reordering
is often applied, which is challenging to parallelize, as its core
is inherently serial. As many-core architectures, like the GPU,
offer subpar single-threading performance and are typically only
connected to high-performance CPU cores via a slow memory
bus, neither computing RCM on the GPU nor moving the
data to the CPU are viable options. Nevertheless, reordering
matrices, potentially multiple times in-between operations, might
be essential for high throughput. Still, to the best of our
knowledge, we are the first to propose an RCM implementation
that can execute on multicore CPUs and many-core GPUs alike,
moving the computation to the data rather than vice versa.
Our algorithm parallelizes RCM into mostly independent
batches of nodes. For every batch, a single CPU-thread/a GPU
thread-block speculatively discovers child nodes and sorts them
according to the RCM algorithm. Before writing their permutation,
we re-evaluate the discovery and build new batches. To
increase parallelism and reduce dependencies, we create a signaling
chain along successive batches and introduce early signaling
conditions. In combination with a parallel work queue, new
batches are started in order and the resulting RCM permutation
is identical to the ground-truth single-threaded algorithm.
We propose the first RCM implementation that runs on the
GPU. It achieves several orders of magnitude speed-up over
NVIDIA’s single-threaded cuSolver RCM implementation and is
significantly faster than previous parallel CPU approaches. Our
results are especially significant for many-core architectures, as
it is now possible to include RCM reordering into sequences of
sparse matrix operations without major performance loss.
IEEE International Parallel and Distributed Processing Symposium (IPDPS'21), 2021
@inproceedings{SpeculativeRCM,
title={Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures},
author={Mlakar, Daniel and Winter, Martin and Parger, Mathias and Steinberger, Markus},
booktitle={2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)},
year={2021},
organization={IEEE}
}
Outline: Since its inception, the CUDA programming model has been continuously evolving. Because the CUDA toolkit aims to consistently expose cutting-edge capabilities for general purpose compute jobs to its users, the added features in each new version reflect the rapid changes that we observe in GPU architectures. Over the years, the changes in hardware, a growing scope of built-in functions and libraries, as well as an advancing C++ standard compliance have expanded the design choices when coding for CUDA, and significantly altered the directives to achieve peak performance. In this tutorial, we give a thorough introduction to the CUDA toolkit, demonstrate how a contemporary application can benefit from recently introduced features and how they can be applied to task-based GPU scheduling in particular.
To provide a profound understanding of how CUDA applications can achieve peak performance, Part 1 of this tutorial outlines the modern CUDA architecture. Following a basic introduction, we expose how language features are linked to – and constrained by – the underlying physical hardware components. Furthermore, we describe common applications for massively parallel programming, offer a detailed breakdown of potential issues and list ways to mitigate performance impacts. An examplary analysis of PTX and SASS snippets illustrates how code patterns in CUDA are mapped to actual hardware instructions.
In Part 2, we will focus on novel features that were enabled by the arrival of CUDA 10+ toolkits and the Volta+ architectures, such as ITS, tensor cores and the graph API. In addition to basic use case demonstrations, we outline our own experiences with these capabilities and their potential performance benefits. We also discuss how long-standing best practices are affected by these changes and describe common caveats for dealing with legacy code on recent GPU models. We show how these considerations can be implemented in practice by presenting state-of-the-art research into task-based GPU scheduling, and how the dynamic adjustment of thread roles and group configurations can significantly increase performance.
Eurographics Tutorials'21, 2021
J34
Jozef Hladky, Hans-Peter Seidel, Markus Steinberger:
Abstract: Streaming rendering, e.g., rendering in the cloud and streaming via a mobile connection, suffers from increased latency and unreliable connections. High quality framerate upsampling can hide these issues, especially when capturing shading into an atlas and transmitting it alongside geometric information. The captured shading information must consider triangle footprints and temporal stability to ensure efficient video encoding. Previous approaches only consider either temporal stability or sample distributions, but none focuses on both. With SnakeBinning, we present an efficient triangle packing approach that adjusts sample distributions and caters for temporal coherence. Using a multi-dimensional binning approach, we enforce tight packing among triangles while creating optimal sample distributions. Our binning is built on top of hardware supported real-time rendering where bins are mapped to individual pixels in a virtual framebuffer. Fragment shader interlock and atomic operations enforce global ordering of triangles within each bin, and thus temporal coherence according to the primitive order is achieved. Resampling the bin distribution guarantees high occupancy among all bins and a dense atlas packing. Shading samples are directly captured into the atlas using a rasterization pass, adjusting samples for perspective effects and creating a tight packing. Comparison to previous atlas packing approaches shows that our approach is faster than previous work and achieves the best sample distributions while maintaining temporal coherence. In this way, SnakeBinning achieves the highest rendering quality under equal atlas memory requirements. At the same time, its temporal coherence ensures that we require equal or less bandwidth than previous state-of-the-art. As SnakeBinning outperforms previous approach in all relevant aspects, it is the preferred choice for texture-based streaming rendering.
Computer Graphics Forum (Eurographics'21), 2021
@article {10.1111:cgf.142648,
journal = {Computer Graphics Forum},
title = {{SnakeBinning: Efficient Temporally Coherent Triangle Packing for Shading Streaming}},
author = {Hladky, Jozef and Seidel, Hans-Peter and Steinberger, Markus},
year = {2021},
publisher = {The Eurographics Association and John Wiley & Sons Ltd.},
ISSN = {1467-8659},
DOI = {10.1111/cgf.142648}
}
Abstract: Temporal coherence has the potential to enable a huge reduction of shading costs in rendering. Existing techniques focus either only on spatial shading reuse or cannot adaptively choose temporal shading frequencies. We find that temporal shading reuse is possible for extended periods of time for a majority of samples, and we show under which circumstances users perceive temporal artifacts. Our analysis implies that we can approximate shading gradients to efficiently determine when and how long shading can be reused. Whereas visibility usually stays temporally coherent from frame to frame for more than 90%, we find that even in heavily animated game scenes with advanced shading, typically more than 50% of shading is also temporally coherent. To exploit this potential, we introduce a temporally adaptive shading framework and apply it to two real-time methods. Its application saves more than 57% of the shader invocations, reducing overall rendering times up to in virtual reality applications without a noticeable loss in visual quality. Overall, our work shows that there is significantly more potential for shading reuse than currently exploited.
ACM Transaction on Graphics (TOG) Presented at SIGGRAPH 2021, 2021
@article{10.1145/3446790,
author = {Mueller, Joerg H. and Neff, Thomas and Voglreiter, Philip and Steinberger, Markus and Schmalstieg, Dieter},
title = {Temporally Adaptive Shading Reuse for Real-Time Rendering and Virtual Reality},
year = {2021},
issue_date = {April 2021},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {40},
number = {2},
issn = {0730-0301},
url = {https://doi.org/10.1145/3446790},
doi = {10.1145/3446790},
journal = {ACM Trans. Graph.},
month = apr,
articleno = {11},
numpages = {14}
}
Abstract: Dynamic memory management on GPUs is generally understood to be a challenging topic. On current GPUs, hundreds of thousands of threads might concurrently allocate new memory or free previously allocated memory. This leads to problems with thread contention, synchronization overhead and fragmentation. Various approaches have been proposed in the last ten years and we set out to evaluate them on a level playing field on modern hardware to answer the question, if dynamic memory managers are as slow as commonly thought of. In this survey paper, we provide a consistent framework to evaluate all publicly available memory managers in a large set of scenarios. We summarize each approach and thoroughly evaluate allocation performance (thread-based as well as warp-based), and look at performance scaling, fragmentation and real-world performance considering a synthetic workload as well as updating dynamic graphs. We discuss the strengths and weaknesses of each approach and provide guidelines for the respective best usage scenario. We provide a unified interface to integrate any of the tested memory managers into an application and switch between them for benchmarking purposes. Given our results, we can dispel some of the dread associated with dynamic memory managers on the GPU.
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'21), 2021
@inproceedings{10.1145/3437801.3441612,
author = {Winter, Martin and Parger, Mathias and Mlakar, Daniel and Steinberger, Markus},
title = {Are Dynamic Memory Managers on GPUs Slow? A Survey and Benchmarks},
year = {2021},
isbn = {9781450382946},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3437801.3441612},
doi = {10.1145/3437801.3441612},
booktitle = {Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages = {219–233},
numpages = {15},
location = {Virtual Event, Republic of Korea},
series = {PPoPP '21}
}
Abstract: The present disclosure relates to methods and devices for operation of a GPU. The device can determine a first subset of primitives associated with a set of objects within an image. The first subset of primitives can be based on a first viewpoint with respect to the set of objects. The device can also determine, for a second viewpoint with respect to the set of objects, a second subset of primitives excluding the first subset of primitives. In some aspects, the second subset of primitives can have a difference in depth with respect to the first subset of primitives that is less than a threshold depth. Additionally, the device can mark the first subset of primitives and the second subset of primitives as visible. Further, the device can generate graphical content based on the marked first subset of primitives and the marked second subset of primitives.
US Patent app., 2020
@@misc{dokter2020methods,
title={Methods and apparatus for improving subpixel visibility},
author={Dokter, Mark and Schmalstieg, Dieter and Steinberger, Markus},
year={2020},
month=dec # "~15",
publisher={Google Patents},
note={US Patent 10,867,431}
}
Abstract: The invention relates to a method for transmitting 3D model data, the 3D model data
comprising polygons, from a server to a client for rendering, the method comprising:
obtaining the 3D model data by the server; and transmitting the 3D model data from the
server to the client. According to the invention, the 3D model data is obtained by the server,
based on a given multitude of possible views.
US Patent app., 2020
@misc{hladky2020real,
title={Real-time potentially visible set for streaming rendering},
author={Hladk{\`y}, Jozef and Steinberger, Markus and Seidel, Hans-Peter},
year={2020},
month=oct # "~15",
publisher={Google Patents},
note={US Patent App. 16/859,315}
}
Abstract: Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU),
is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active
threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory
directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue.
In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top
of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable
user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink
dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain
the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA
TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing,
we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations,
such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic
performance on PageRank and Static Triangle Counting.
Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model,
a drop-in replacement for comparable dynamic memory allocators.
International Conference on Supercomputing (ICS'20), 2020
@inproceedings{10.1145/3392717.3392742,
author = {Winter, Martin and Mlakar, Daniel and Parger, Mathias and Steinberger, Markus},
title = {Ouroboros: Virtualized Queues for Dynamic Memory Management on GPUs},
year = {2020},
isbn = {9781450379830},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3392717.3392742},
doi = {10.1145/3392717.3392742},
booktitle = {Proceedings of the 34th ACM International Conference on Supercomputing},
articleno = {38},
numpages = {12},
keywords = {queueing, dynamic graphs, resource management, dynamic memory allocation, GPU},
location = {Barcelona, Spain},
series = {ICS ’20}
}
Abstract: Efficient rendering of multiple views can be a critical performance factor for real-time rendering
applications. Generating more than one view multiplies the amount of rendered geometry, which can cause a huge
performance impact. Minimizing that impact has been a target of previous research and GPU manufacturers, who have
started to equip devices with dedicated acceleration units. However, vendor-specific acceleration is not the only
option to increase multi-view rendering (MVR) performance. Available graphics API features, shader stages and
optimizations can be exploited for improved MVR performance, while generally offering more versatile pipeline
configurations, including the preservation of custom tessellation and geometry shaders. In this paper, we present
an exhaustive evaluation of MVR pipelines available on modern GPUs. We provide a detailed analysis of previous
techniques, hardware-accelerated MVR and propose a novel method, leading to the creation of an MVR catalogue.
Our analyses cover three distinct applications to help gain clarity on overall MVR performance characteristics.
Our interpretation of the observed results provides a guideline for selecting the most appropriate one for various
use cases on different GPU architectures.
Eurographics Symposium on Parallel Graphics and Visualization (EGPGV ‘20), 2020
@inproceedings{Unterguggenberger:2020:FMVR,
title = {Fast Multi-View Rendering for Real-Time Applications},
author = {Johannes Unterguggenberger and Bernhard Kerbl and Markus
Steinberger and Dieter Schmalstieg and Michael Wimmer},
year = {2020},
month = may,
booktitle = {Eurographics Symposium on Parallel Graphics and
Visualization},
event = {EGPGV 2020},
pages = {13--23},
}
Abstract: The rich and evocative patterns of natural tessellations endow them with an unmistakable
artistic appeal and structural properties which are echoed across design, production, and manufacturing.
Unfortunately, interactive control of such patterns-as modeled by Voronoi diagrams, is limited to the simple
two dimensional case and does not extend well to freeform surfaces. We present an approach for direct modeling
and editing of such cellular structures on surface meshes. The overall modeling experience is driven by a set
of editing primitives which are efficiently implemented on graphics hardware. We feature a novel application
for 3D printing on modern support-free additive manufacturing platforms. Our method decomposes the input surface
into a cellular skeletal structure which hosts a set of overlay shells. In this way, material saving can be
channeled to the shells while structural stability is channeled to the skeleton. To accommodate the available
printer build volume, the cellular structure can be further split into moderately sized parts. Together with
shells, they can be conveniently packed to save on production time. The assembly of the printed parts is
streamlined by a part numbering scheme which respects the geometric layout of the input model.
Computer Graphics Forum / Eurographics (EG'20), 2020
@article {10.1111:cgf.13929,
journal = {Computer Graphics Forum},
title = {{Interactive Modeling of Cellular Structures on Surfaces with Application to Additive Manufacturing}},
author = {Stadlbauer, Pascal and Mlakar, Daniel and Seidel, Hans-Peter and Steinberger, Markus and Zayer, Rhaleb},
year = {2020},
publisher = {The Eurographics Association and John Wiley & Sons Ltd.},
ISSN = {1467-8659},
DOI = {10.1111/cgf.13929}
}
Abstract: Subdivision surfaces have become an invaluable asset in production environments. While progress over the last years has allowed
the use of graphics hardware to meet performance demands during animation and rendering, high-performance is limited
to immutable mesh connectivity scenarios. Motivated by recent progress in mesh data structures, we show how the complete
Catmull-Clark subdivision scheme can be abstracted in the language of linear algebra. While this high-level formulation
allows for a fully parallel implementation with significant performance gains, the underlying algebraic operations require
further specialization for modern parallel hardware. Integrating domain knowledge about the mesh matrix data structure, we
replace costly general linear algebra operations like matrix-matrix multiplication by specialized kernels. By further considering
innate properties of Catmull-Clark subdivision, like the quad-only structure after refinement, we achieve an additional order of
magnitude in performance and significantly reduce memory footprints. Our approach can be adapted seamlessly for different use
cases, such as regular subdivision of dynamic meshes, fast evaluation for immutable topology and feature-adaptive subdivision
for efficient rendering of animated models. In this way, patchwork solutions are avoided in favor of a streamlined solution with
consistent performance gains throughout the production pipeline. The versatility of the sparse matrix linear algebra abstraction
underlying our work is further demonstrated by extension to other schemes such as √3 and Loop subdivision.
Eurographics '20 Best Paper Award Computer Graphics Forum / Eurographics (EG'20), 2020
@article {10.1111:cgf.13934,
journal = {Computer Graphics Forum},
title = {{Subdivision-Specialized Linear Algebra Kernels for Static and Dynamic Mesh Connectivity on the GPU}},
author = {Mlakar, Daniel and Winter, Martin and Stadlbauer, Pascal and Seidel, Hans-Peter and Steinberger, Markus and Zayer, Rhaleb},
year = {2020},
publisher = {The Eurographics Association and John Wiley & Sons Ltd.},
ISSN = {1467-8659},
DOI = {10.1111/cgf.13934}
}
Abstract: With the introduction of hardware-supported ray tracing and deep learning
for denoising, computer graphics has made a considerable step toward real-time global illumination.
In this work, we present an alternative global illumination method: The stochastic substitute tree
(SST), a hierarchical structure inspired by lightcuts with light probability distributions as inner
nodes. Our approach distributes virtual point lights (VPLs) in every frame and efficiently constructs
the SST over those lights by clustering according to Morton codes. Global illumination is approximated
by sampling the SST and considers the BRDF at the hit location as well as the SST nodes’ intensities
for importance sampling directly from inner nodes of the tree. To remove the introduced Monte Carlo
noise, we use a recurrent autoencoder. In combination with temporal filtering, we deliver real-time
global illumination for complex scenes with challenging light distributions.
Symposium on Interactive 3D Graphics and Games (I3D '20), 2020
@inbook{10.1145/3384382.3384521,
author = {Tatzgern, Wolfgang and Mayr, Benedikt and Kerbl, Bernhard and Steinberger, Markus},
title = {Stochastic Substitute Trees for Real-Time Global Illumination},
year = {2020},
isbn = {9781450375894},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3384382.3384521},
booktitle = {Symposium on Interactive 3D Graphics and Games},
articleno = {2},
numpages = {9}
}
Abstract: Sparse general matrix-matrix multiplication on GPUs is challenging due to
the varying sparsity patterns of sparse matrices. Existing solutions achieve good performance
for certain types of matrices, but fail to accelerate all kinds of matrices in the same manner.
Our approach combines multiple strategies with dynamic parameter selection to dynamically choose
and tune the best fitting algorithm for each row of the matrix. This choice is supported by a
lightweight, multi-level matrix analysis, which carefully balances analysis cost and expected
performance gains. Our evaluation on thousands of matrices with various characteristics shows
that we outperform all currently available solutions in 79% over all matrices with >15k products
and that we achieve the second best performance in 15%. For these matrices, our solution is on
average 83% faster than the second best approach and up to 25X faster than other state-of-the-art
GPU implementations. Using our approach, applications can expect great performance independent of
the matrices they work on.
Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '20), 2020
@inproceedings{10.1145/3332466.3374521,
author = {Parger, Mathias and Winter, Martin and Mlakar, Daniel and Steinberger, Markus},
title = {SpECK: Accelerating GPU Sparse Matrix-Matrix Multiplication through Lightweight Analysis},
year = {2020},
isbn = {9781450368186},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3332466.3374521},
doi = {10.1145/3332466.3374521},
booktitle = {Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming},
pages = {362–375},
numpages = {14},
location = {San Diego, California},
series = {PPoPP ’20}
}
Abstract: Potential visibility has historically always been of importance when rendering
performance was insufficient. With the rise of virtual reality, rendering power may once again be
insufficient, e.g., for integrated graphics of head-mounted displays. To tackle the issue of efficient
potential visibility computations on modern graphics hardware, we introduce the camera offset space
(COS). Opposite to how traditional visibility computations work---where one determines which pixels
are covered by an object under all potential viewpoints---the COS describes under which camera movement
a sample location is covered by a triangle. In this way, the COS opens up a new set of possibilities
for visibility computations. By evaluating the pairwise relations of triangles in the COS, we show
how to efficiently determine occluded triangles. Constructing the COS for all pixels of a rendered
view leads to a complete potentially visible set (PVS) for complex scenes. By fusing triangles to
larger occluders, including locations between pixel centers, and considering camera rotations, we
describe an exact PVS algorithm that includes all viewing directions inside a view cell. Implementing
the COS is a combination of real-time rendering and compute steps. We provide the first GPU PVS
implementation that works without preprocessing, on-the-fly, on unconnected triangles. This opens
the door to a new approach of rendering for virtual reality head-mounted displays and server-client
settings for streaming 3D applications such as video games.
ACM Transactions on Graphics (SIGGRAPH Asia'19), 2019
@article{Hladky:2019:COS:3355089.3356530,
author = {Hladky, Jozef and Seidel, Hans-Peter and Steinberger, Markus},
title = {The Camera Offset Space: Real-time Potentially Visible Set Computations for Streaming Rendering},
journal = {ACM Trans. Graph.},
issue_date = {November 2019},
volume = {38},
number = {6},
month = nov,
year = {2019},
issn = {0730-0301},
pages = {231:1--231:14},
articleno = {231},
numpages = {14},
url = {http://doi.acm.org/10.1145/3355089.3356530},
doi = {10.1145/3355089.3356530},
acmid = {3356530},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {GPU, potentially visible set, real-time rendering, streaming rendering, virtual reality, visibility},
}
Abstract:
Breadth-First Search is an important basis for many different graph-based algorithms with applications ranging from peer-to-peer
networking to garbage collection. However, the performance of different approaches depends strongly on the type of graph. In this
paper, we present an efficient algorithm that performs well on a variety of different graphs. As part of this, we look into
utilizing dynamic parallelism in order to both reduce overhead from latency between the CPU and GPU, as well as speed up the
algorithm itself. Lastly, integrate the algorithm with the faimGraph framework for dynamic graphs and examine the relative
performance to a Compressed-Sparse-Row data structure. We show that our algorithm can be well adapted to the dynamic setting and
outperforms another competing dynamic graph framework on our test set.
High Performance Extreme Computing, 2019
@INPROCEEDINGS{8916476,
author={Dominik Tödling and Martin Winter and Markus Steinberger},
booktitle={2019 IEEE High Performance Extreme Computing Conference (HPEC)},
title={Breadth-First Search on Dynamic Graphs using Dynamic Parallelism on the GPU},
year={2019},
volume={},
number={},
pages={1-7},
doi={10.1109/HPEC.2019.8916476},
ISSN={2377-6943},
month={Sep.},}
Abstract: Streaming high quality rendering for virtual reality applications requires minimizing perceived latency.
Shading Atlas Streaming (SAS) [Mueller et al. 2018] is a novel object-space rendering framework suitable for streaming virtual
reality content. SAS decouples server-side shading from client-side rendering, allowing the client to perform framerate upsampling
and latency compensation autonomously for short periods of time. The shading information created by the server in object space is
temporally coherent and can be efficiently compressed using standard MPEG encoding. SAS compares favorably to previous methods for
remote image-based rendering in terms of image quality and network bandwidth efficiency. SAS allows highly efficient parallel
allocation in a virtualized-texture-like memory hierarchy, solving a common efficiency problem of object-space shading. With SAS,
untethered virtual reality headsets can benefit from high quality rendering without paying in increased latency. Visitors will be
able to try SAS by roaming the exhibit area wearing a Snapdragon 845 based headset connected via consumer WiFi.
ACM SIGGRAPH 2019 Emerging Technologies, 2019
@inproceedings{Mueller:2019:SAS:3305367.3327981,
author = {Mueller, Joerg H. and Neff, Thomas and Voglreiter, Philip and Makar, Mina and Steinberger, Markus and Schmalstieg, Dieter},
title = {Shading Atlas Streaming Demonstration},
booktitle = {ACM SIGGRAPH 2019 Emerging Technologies},
series = {SIGGRAPH '19},
year = {2019},
isbn = {978-1-4503-6308-2},
location = {Los Angeles, California},
pages = {22:1--22:2},
articleno = {22},
numpages = {2},
url = {http://doi.acm.org/10.1145/3305367.3327981},
doi = {10.1145/3305367.3327981},
acmid = {3327981},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {object-space shading, shading, streaming, texture atlas, virtual reality},
}
Abstract:
Presenting high-fidelity 3D content on compact portable devices with low computational power is challenging. Smartphones,
tablets and head-mounted displays (HMDs) suffer from thermal and battery-life constraints and thus cannot match the render
quality of desktop PCs and laptops. Streaming rendering enables to show high-quality content but can suffer from potentially
high latency. We propose an approach to efficiently capture shading samples in object space and packing them into a texture.
Streaming this texture to the client, we support temporal frame up-sampling with high fidelity, low latency and high mobility.
We introduce two novel sample distribution strategies and a novel triangle representation in the shading atlas space. Since such
a system requires dynamic parallelism, we propose an implementation exploiting the power of hardware-accelerated tessellation
stages. Our approach allows fast de-coding and rendering of extrapolated views on a client device by using hardwareaccelerated
interpolation between shading samples and a set of potentially visible geometry. A comparison to existing shading
methods shows that our sample distributions allow better client shading quality than previous atlas streaming approaches and
outperforms image-based methods in all relevant aspects.
Computer Graphics Forum / Eurographics Symposium on Rendering (EGSR'19), 2019
@article {Hladky:2019:TSS:10.1111:cgf.13780,
journal = {Computer Graphics Forum},
title = {{Tessellated Shading Streaming}},
author = {Hladky, Jozef and Seidel, Hans-Peter and Steinberger, Markus},
year = {2019},
publisher = {The Eurographics Association and John Wiley & Sons Ltd.},
ISSN = {1467-8659},
DOI = {10.1111/cgf.13780}
}
Abstract:
A system utilizing specified occlusion techniques to reduce the overall amount of occlusion computations required to generate
an image in a changing viewpoint environment. Viewpoint location changes about a current viewpoint can result in a different
image view with the potential exposure of previously occluded portions of a scene image that would now be now visible from that
new viewpoint location. To reduce the amount of occlusion computations required to render an associated image in a changing
viewpoint environment, techniques are described that reduce occlusion computations using, for example, one or more trim region
rendering techniques. Some techniques include generating potential visible set (PVS) information based on a viewport including
an original viewpoint location and an anticipated change in viewpoint location.
US Patent App. 15/867,401, 2019
@misc{schmalstieg2019accelerated,
title={Accelerated occlusion computation},
author={Schmalstieg, Dieter and Steinberger, Markus and Voglreiter, Philip},
year={2019},
month=jul # "~11",
publisher={Google Patents},
note={US Patent App. 15/867,401}
}
Abstract:
As virtual 3D worlds continue to expand in size, a growing number of exploratory games and simulations aim to visualize entire planets
and galaxies in real-time. However, consistent rendering of the whole planetary surface at varying levels of detail poses a number of challenges.
As the viewer transitions from observing a planet from a distance to the ground level, fine-granular details manifest in its terrain geometry.
For presenting these varying amounts of detail, a suitable solution must consider how the geometry should behave to ensure seamless accommodation
for all frames of reference. Instead of storing a complete model of the planetary surface, more efficient alternatives employ structured base
meshes of limited complexity, which can then be enhanced with elevation data from a height field. Based on this concept, we propose a new method
for applying terrain data to spherical surfaces of arbitrary scale. Our approach uses a uniform grid that is mapped to a paraboloid and can be
offset to influence the distribution of height field samples on the sphere. Starting from its initial layout, we show how the grid can be altered
at runtime to avoid a variety of notorious artifacts during camera motion. We further derive suitable grid offset parameters to ensure correct
visibility of distant objects behind the horizon. By decoupling the mesh geometry from the height field data and display resolution, the
overhead of our method can be bounded and is subject only to the grid configuration. Compared to previous approaches, our solution is easy to
implement, requires no preprocessing and achieves reduced runtime for rendering scenes at a similar quality.
I3D 2019 Best Poster Award Symposium on Interactive 3D Graphics and Games 2011 (I3D), 2019
J28
Mark Dokter, Jozef Hladky, Mathias Parger, Dieter Schmalstieg, Hans-Peter Seidel, Markus Steinberger:
Abstract:
In this paper, we introduce the CPatch, a curved primitive that can be used to construct arbitrary vector graphics. A CPatch
is a generalization of a 2D polygon: Any number of curves up to a cubic degree bound a primitive. We show that a CPatch can
be rasterized efficiently in a hierarchical manner on the GPU, locally discarding irrelevant portions of the curves. Our
rasterizer is fast and scalable, works on all patches in parallel, and does not require any approximations. We show a parallel
implementation of our rasterizer, which naturally supports all kinds of color spaces, blending and super‐sampling. Additionally,
we show how vector graphics input can efficiently be converted to a CPatch representation, solving challenges like patch self
intersections and false inside‐outside classification. Results indicate that our approach is faster than the state-of-the-art,
more flexible and could potentially be implemented in hardware.
Computer Graphics Forum / Eurographics (EG'19), 2019
@article{Dokter:2019:HRC:10.1111/cgf.13622,
author = {Dokter, Mark and Hladky, Jozef and Parger, Mathias and Schmalstieg, Dieter and Seidel, Hans-Peter and Steinberger, Markus},
title = {Hierarchical Rasterization of Curved Primitives for Vector Graphics Rendering on the GPU},
journal = {Computer Graphics Forum},
volume = {38},
number = {2},
pages = {93-103},
doi = {10.1111/cgf.13622},
url = {https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.13622},
eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1111/cgf.13622},
year = {2019}
}
Abstract:
In the ongoing efforts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM)
has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important,
this disparity can be attributed mainly to the additional formidable challenges raised by SpGEMM.
In this paper, we present a dynamic approach for addressing SpGEMM on the GPU. Our approach works directly
on the standard compressed sparse rows (CSR) data format. In comparison to previous SpGEMM implementations,
our approach guarantees a homogeneous, load-balanced access pattern to the first input matrix and improves
memory access to the second input matrix. It adaptively re-purposes GPU threads during execution and maximizes
the time efficient on-chip scratchpad memory can be used. Adhering to a completely deterministic scheduling pattern
guarantees bit-stable results during repetitive execution, a property missing from other approaches. Evaluation on
an extensive sparse matrix benchmark suggests our approach being the fastest SpGEMM implementation for highly sparse
matrices (80% of the set). When bit-stable results are sought, our approach is the fastest across the entire test set.
Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019
@inproceedings{Winter:2019:ASM:3293883.3295701,
author = {Winter, Martin and Mlakar, Daniel and Zayer, Rhaleb and Seidel, Hans-Peter and Steinberger, Markus},
title = {Adaptive Sparse Matrix-matrix Multiplication on the GPU},
booktitle = {Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming},
series = {PPoPP '19},
year = {2019},
isbn = {978-1-4503-6225-2},
location = {Washington, District of Columbia},
pages = {68--81},
numpages = {14},
url = {http://doi.acm.org/10.1145/3293883.3295701},
doi = {10.1145/3293883.3295701},
acmid = {3295701},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {ESC, GPU, SpGEMM, adaptive, bit-stable, sparse matrix},
}
Abstract: Mimicking natural tessellation patterns is a fascinating multi-disciplinary
problem. Geometric methods aiming at reproducing such partitions on surface
meshes are commonly based on the Voronoi model and its variants, and
are often faced with challenging issues such as metric estimation, geometric,
topological complications, and most critically, parallelization. In this paper,
we introduce an alternate model which may be of value for resolving these
issues. We drop the assumption that regions need to be separated by lines.
Instead, we regard region boundaries as narrow bands and we model the
partition as a set of smooth functions layered over the surface. Given an
initial set of seeds or regions, the partition emerges as the solution of a
time dependent set of partial differential equations describing concurrently
evolving fronts on the surface. Our solution does not require geodesic estimation,
elaborate numerical solvers, or complicated bookkeeping data
structures. The cost per time-iteration is dominated by the multiplication
and addition of two sparse matrices. Extension of our approach in a Lloyd’s
algorithm fashion can be easily achieved and the extraction of the dual
mesh can be conveniently preformed in parallel through matrix algebra. As
our approach relies mainly on basic linear algebra kernels, it lends itself to
efficient implementation on modern graphics hardware.
ACM Transactions on Graphics (SIGGRAPH Asia'18), 2018
@article{Zayer:2018:LFN,
author = {Zayer, Rhaleb and Mlakar, Daniel and Steinberger, Markus and Seidel, Hans-Peter},
title = {Layered Fields for Natural Tessellations on Surfaces},
journal = {ACM Trans. Graph.},
issue_date = {November 2018},
volume = {37},
number = {6},
month = nov,
year = {2018},
articleno = {264},
numpages = {15},
doi = {10.1145/3272127.3275072},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: Streaming high quality rendering for virtual reality applications requires
minimizing perceived latency. We introduce Shading Atlas Streaming (SAS),
a novel object-space rendering framework suitable for streaming virtual reality
content. SAS decouples server-side shading from client-side rendering,
allowing the client to perform framerate upsampling and latency compensation
autonomously for short periods of time. The shading information
created by the server in object space is temporally coherent and can be
efficiently compressed using standard MPEG encoding. Our results show
that SAS compares favorably to previous methods for remote image-based
rendering in terms of image quality and network bandwidth efficiency. SAS
allows highly efficient parallel allocation in a virtualized-texture-like memory
hierarchy, solving a common efficiency problem of object-space shading.
With SAS, untethered virtual reality headsets can benefit from high quality
rendering without paying in increased latency.
ACM Transactions on Graphics (SIGGRAPH Asia'18), 2018
@article{Mueller:2018:SAS,
author = {Mueller, Joerg H. and Voglreiter, Philip and Dokter, Mark and Neff, Thomas and Makar, Mina and Steinberger, Markus and Schmalstieg, Dieter},
title = {Shading Atlas Streaming},
journal = {ACM Trans. Graph.},
issue_date = {November 2018},
volume = {37},
number = {6},
month = nov,
year = {2018},
articleno = {199},
numpages = {16},
doi = {10.1145/3272127.3275087},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: Having a virtual body can increase embodiment in virtual reality
(VR) applications. However, comsumer-grade VR falls short of
delivering sufficient sensory information for full-body motion capture.
Consequently, most current VR applications do not even show
arms, although they are often in the field of view. We address this
shortcoming with a novel human upper-body inverse kinematics
algorithm specifically targeted at tracking from head and hand sensors
only. We present heuristics for elbow positioning depending
on the shoulder-to-hand distance and for avoiding reaching unnatural
joint limits. Our results show that our method increases the
accuracy compared to general inverse kinematics applied to human
arms with the same tracking input. In a user study, participants
preferred our method over displaying disembodied hands without
arms, but also over a more expensive motion capture system. In particular,
our study shows that virtual arms animated with our inverse
kinematics system can be used for applications involving heavy
arm movement. We demonstrate that our method can not only
be used to increase embodiment, but can also support interaction
involving arms or shoulders, such as holding up a shield.
Symposium on Virtual Reality Software and Technology (VRST ’18), 2018
@inproceedings{Parger:2018:HUI,
author = {Parger, Mathias and Mueller, Joerg H. and Schmalstieg, Dieter and Steinberger, Markus},
title = {Human Upper-Body Inverse Kinematics for Increased Embodiment in Consumer-Grade Virtual Reality},
booktitle = {Symposium on Virtual Reality Software and Technology},
series = {VRST '18},
year = {2018},
pages = {10},
isbn = {978-1-4503-6086-9/18/11},
doi = {10.1145/3281505.3281529},
location = {Tokyo, Japan},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: In this paper, we present a fully-dynamic graph data
structure for the Graphics Processing Unit (GPU). It delivers
high update rates while keeping a low memory footprint using
autonomous memory management directly on the GPU. The
data structure is fully-dynamic, allowing not only for edge
but also vertex updates. Performing the memory management
on the GPU allows for fast initialization times and efficient
update procedures without additional intervention or reallocation
procedures from the host. Our optimized approach performs
initialization completely in parallel; up to 300x faster compared
to previous work. It achieves up to 200 million edge updates per
second for sorted and unsorted update batches; up to 30x faster
than previous work. Furthermore, it can perform more than
300 million adjacency queries and millions of vertex updates per
second. On account of efficient memory management techniques
like a queuing approach, currently unused memory is reused later
on by the framework, permitting the storage of tens of millions
of vertices and hundreds of millions of edges in GPU memory.
We evaluate algorithmic performance using a PageRank and a
Static Triangle Counting (STC) implementation, demonstrating
the suitability of the framework even for memory access intensive
algorithms.
High Performance Computing, Networking, Storage and Analysis (SC’18), 2018
@inproceedings{Winter:2018:faimGraph,
author = {Winter, Martin and Mlakar, Daniel and Zayer, Rhaleb and Seidel, Hans-Peter and Steinberger, Markus},
title = {faimGraph: High Performance Management of Fully-Dynamic Graphs under tight Memory Constraints on the GPU},
booktitle = {High Performance Computing, Networking, Storage and Analysis},
series = {SC '18},
year = {2018},
isbn = {978-1-5386-8384-2/18},
location = {Dallas, Texas, USA},
}
Abstract: In this paper, we present a real-time graphics pipeline implemented entirely
in software on a modern GPU. As opposed to previous work, our approach
features a fully-concurrent, multi-stage, streaming design with dynamic
load balancing, capable of operating efficiently within bounded memory. We
address issues such as primitive order, vertex reuse, and screen-space deriva-
tives of dependent variables, which are essential to real-world applications,
but have largely been ignored by comparable work in the past. The power of
a software approach lies in the ability to tailor the graphics pipeline to any
given application. In exploration of this potential, we design and implement
four novel pipeline modifications. Evaluation of the performance of our
approach on more than 100 real-world scenes collected from video games
shows rendering speeds within one order of magnitude of the hardware
graphics pipeline as well as significant improvements over previous work,
not only in terms of capabilities and performance, but also robustness.
ACM Transactions on Graphics (SIGGRAPH'18), 2018
@article{Kenzel:2018:CURE,
author = {Kenzel, Michael and Kerbl, Bernhard and Schmalstieg, Dieter and Steinberger, Markus},
title = {A High-Performance Software Graphics Pipeline Architecture for the GPU},
journal = {ACM Trans. Graph.},
issue_date = {November 2018},
volume = {37},
number = {4},
month = nov,
year = {2018},
articleno = {140},
numpages = {15},
doi = {10.1145/3197517.3201374},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: In this paper, we question the premise that graphics hardware uses a post-transform cache to avoid redundant
vertex shader invocations. A large body of existing work on optimizing indexed triangle sets for rendering
speed is based upon this widely-accepted assumption. We conclusively show that this assumption does not
hold up on modern graphics hardware. We design and conduct experiments that demonstrate the behavior
of current hardware of all major vendors to be inconsistent with the presence of a common post-transform
cache. Our results strongly suggest that modern hardware rather relies on a batch-based approach, most
likely for reasons of scalability. A more thorough investigation based on these initial experiments allows us
to partially uncover the actual strategies implemented on graphics processors today. We reevaluate existing
mesh optimization algorithms in light of these new findings and present a new mesh optimization algorithm
designed from the ground up to target architectures that rely on batch-based vertex reuse. In an extensive
evaluation, we measure and compare the real-world performance of various optimization algorithms on
modern hardware. Our results show that some established algorithms still perform well. However, if the
batching strategy of the target architecture is known, our approach can significantly outperform these previous
state-of-the-art methods.
Proceedings of the ACM on Computer Graphics and Interaction Techniques (HPG'18), 2018
@article{Kerbl:2018:SR,
author = {Kerbl, Bernhard and Kenzel, Michael and Ivanchenko, Elena and Schmalstieg, Dieter and Steinberger, Markus},
title = {Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU},
journal = {Proc. ACM Comput. Graph. Interact. Tech.},
issue_date = {August 2018},
volume = {1},
number = {2},
month = aug,
year = {2018},
articleno = {29},
numpages = {16},
doi = {10.1145/3233302},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: Due to its flexibility, compute mode is becoming more and more attractive as a way to implement many of the
algorithms part of a state-of-the-art rendering pipeline. A key problem commonly encountered in graphics
applications is streaming vertex and geometry processing. In a typical triangle mesh, the same vertex is on
average referenced six times. To avoid redundant computation during rendering, a post-transform cache is
traditionally employed to reuse vertex processing results. However, such a vertex cache can generally not be
implemented efficiently in software and does not scale well as parallelism increases. We explore alternative
strategies for reusing per-vertex results on-the-fly during massively-parallel software geometry processing.
Given an input stream divided into batches, we analyze the effectiveness of sorting, hashing, and intra-threadgroup
communication for identifying and exploiting local reuse potential. We design and present four vertex
reuse strategies tailored to modern GPU architectures. We demonstrate that, in a variety of applications, these
strategies not only achieve effective reuse of vertex processing results, but can boost performance by up to
2–3x compared to a naïve approach. Curiously, our experiments also show that our batch-based approaches
exhibit behavior similar to the OpenGL implementation on current graphics hardware.
Proceedings of the ACM on Computer Graphics and Interaction Techniques (HPG'18), 2018
@article{Kenzel:2018:OVR,
author = {Kenzel, Michael and Kerbl, Bernhard and Tatzgern, Wolfgang and Ivanchenko, Elena and Schmalstieg, Dieter and Steinberger, Markus},
title = {On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing},
journal = {Proc. ACM Comput. Graph. Interact. Tech.},
issue_date = {August 2018},
volume = {1},
number = {2},
month = aug,
year = {2018},
articleno = {28},
numpages = {17},
doi = {10.1145/3233303},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: During the last decade we have witnessed a severe change in computing, as processor clock-rates stopped increasing. Thus, the arguable only way to increase processing power is switching to a parallel computing architecture, like the graphics processing unit (GPU). While a GPU offers tremendous processing power, harnessing this power is often difficult. In our research we tackle this issue, providing various components to allow a wider class of algorithms to execute efficiently on the GPU. These efforts include new processing models for dynamic algorithms with various degrees of parallelism, a versatile task scheduler, based on highly efficient work queues which also support dynamic priority scheduling, and efficient dynamic memory management. Our scheduling strategies advance the state-of-the-art algorithms in the field of rendering, visualization, and geometric modeling. In the field of rendering, we provide algorithms that can significantly speed-up image generation, assigning more processing power to the most important image regions. In the field of geometric modeling we provide the first GPU-based grammar evaluation system that can generate and render cities in real-time which otherwise take hours to generate and could not fit into GPU memory. Finally, we show that mesh processing algorithms can be computed significantly faster on the GPU when parallelizing them with advanced scheduling strategies.
IEEE Computer Graphics and Applications, 2018
@article{steinberger2018dynamic,
title={On Dynamic Scheduling for the GPU and its Applications in Computer Graphics and Beyond},
author={Steinberger, Markus},
journal={IEEE computer graphics and applications},
volume={38},
number={3},
pages={119--130},
year={2018},
publisher={IEEE}
}
Abstract: Harnessing the power of massively parallel devices like the graphics
processing unit (GPU) is difficult for algorithms that show dynamic
or inhomogeneous workloads. To achieve high performance, such
advanced algorithms require scalable, concurrent queues to collect
and distribute work. We show that previous queuing approaches are
unfit for this task, as they either (1) do not work well in a massively
parallel environment, or (2) obstruct the use of individual threads
on top of single-instruction-multiple-data (SIMD) cores, or (3) block
during access, thus prohibiting multi-queue setups. With these
issues in mind, we present the Broker Queue, a highly efficient, fully
linearizable FIFO queue for fine-granular parallel work distribution
on the GPU. We evaluate its performance and usability on modern
GPU models against a wide range of existing algorithms. The Broker
Queue is up to three orders of magnitude faster than non-blocking
queues and can even outperform significantly simpler techniques
that lack desired properties for fine-granular work distribution.
International Conference on Supercomputing (ICS'18), 2018
@inproceedings{Kerbl:2018:BQA:3205289.3205291,
title={The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU},
author={Kerbl, Bernhard and Kenzel, Michael and Mueller, Joerg H and Schmalstieg, Dieter and Steinberger, Markus},
booktitle = {Proceedings of the International Conference on Supercomputing},
series = {ICS '18},
year = {2018},
location = {Beijing, China},
numpages = {10},
url = {http://doi.org/10.1145/3205289.3205291},
doi = {10.1145/3205289.3205291},
keywords = {GPU, queuing, concurrent, parallel, scheduling},
}
Abstract: Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We present a new concurrent work queue, the Broker Queue, a highly efficient, linearizable queue for fine-granular work distribution on the GPU. We evaluate its usability and benefits in contrast to existing queuing algorithms. Our queue is up to one order of magnitude faster than non-blocking queues, and outperforms simpler queue designs that are unfit for fine-granular work distribution.
ACM SIGPLAN Notices (PPoPP'18), 2018
@article{Kerbl:2018:SQW:3200691.3178526,
author = {Kerbl, Bernhard and M"{u}ller, J"{o}rg and Kenzel, Michael and Schmalstieg, Dieter and Steinberger, Markus},
title = {A Scalable Queue for Work Distribution on GPUs},
journal = {SIGPLAN Not.},
issue_date = {January 2018},
volume = {53},
number = {1},
month = feb,
year = {2018},
issn = {0362-1340},
pages = {401--402},
numpages = {2},
url = {http://doi.acm.org/10.1145/3200691.3178526},
doi = {10.1145/3200691.3178526},
acmid = {3178526},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {GPU, concurrent, parallel, queuing, scheduling},
}
Abstract: The numerical treatment of variational problems gives rise to large sparse matrices which are
typically assembled by coalescing elementary contributions. As the explicit matrix form is required by direct solvers,
the assembly step can be a potential bottleneck especially in implicit and time dependent settings where considerable
updates are needed. On standard HPC platforms, this process can be vectorized by taking advantage of additional mesh
querying data structures. However, on graphics hardware, vectorization is inhibited by limited memory resources. To
address this issue, we propose a lean unstructured mesh representation which allows casting the assembly problem as a
sparse matrix-matrix multiplication. In this way, the underlying graph connectivity of the assembled matrix is captured
through basic matrix algebra operations. Furthermore, we outline ideas for cutting down on data movement by lowering
the memory footprint of the proposed representation and we analyze the effect of its memory layout on the assembly
performance. As the underlying machinery behind our approach consists mainly of key linear algebra primitives, we
foresee potential portability opportunities for our approach to various architectures.
High Performance Extreme Computing, 2017
@inproceedings{zayer2017sparse,
author={Zayer, Rhaleb and Steinberger, Markus and Seidel, Hans-Peter},
booktitle={2017 IEEE High Performance Extreme Computing Conference (HPEC)},
title={Sparse matrix assembly on the GPU through multiplication patterns},
year={2017},
pages={1-8},
ISSN={},
month={Sept},
}
Abstract: In this paper, we present a new, dynamic graph data structure, built to deliver
high update rates while keeping a low memory footprint using autonomous memory management directly on the GPU.
By transferring the memory management to the GPU, efficient updating of the graph structure and fast
initialization times are enabled as no additional memory allocation calls or reallocation procedures are
necessary since they are handled directly on the device.
In comparison to previous work, this optimized approach allows for significantly lower initialization
times (up to 300x faster) and much higher update rates for significant changes to the graph structure
and equal rates for small changes. The framework provides different update implementations tailored
specifically to different graph properties, enabling over 100 million of updates per second and
keeping tens of millions of vertices and hundreds of millions of edges in memory without transferring
data back and forth between device and host.
HPEC'17 Best Student Paper High Performance Extreme Computing, 2017
@inproceedings{winter2017autonomous,
title={Autonomous, Independent Management of Dynamic Graphs on GPUs},
author={Winter, Martin and Zayer, Rhaleb and Steinberger, Markus},
booktitle={High Performance Extreme Computing Conference (HPEC), 2017 IEEE},
pages={1--8},
year={2017},
organization={IEEE},
month={Sept}
}
Abstract: To effectively utilize an ever increasing number of processors during parallel rendering, hardware and software
designers rely on sophisticated load balancing strategies. While dynamic load balancing is a powerful solution, it requires complex
work distribution and synchronization mechanisms. Graphics hardware manufacturers have opted to employ static load balancing strategies
instead. Specifically, triangle data is distributed to processors based on its overlap with screenspace tiles arranged in a fixed pattern.
While the current strategy of using simple patterns for a small number of fast rasterizers achieves formidable performance, it is
questionable how this approach will scale as the number of processors increases further. To address this issue, we analyze real-world
rendering workloads, derive requirements for effective patterns, and present ten different pattern design strategies based on these
requirements. In addition to a theoretical evaluation of these design strategies, we compare the performance of select patterns in a
parallel sort-middle software rendering pipeline on an extensive set of triangle data captured from eight recent video games. As a result,
we are able to identify a set of patterns that scale well and exhibit significantly improved performance over naïve approaches.
High Performance Graphics (HPG '17), 2017
@inproceedings{Kerbl:2017:ESB:3105762.3105777,
author = {Kerbl, Bernhard and Kenzel, Michael and Schmalstieg, Dieter and Steinberger, Markus},
title = {Effective Static Bin Patterns for Sort-middle Rendering},
booktitle = {Proceedings of High Performance Graphics},
series = {HPG '17},
year = {2017},
isbn = {978-1-4503-5101-0},
location = {Los Angeles, California},
pages = {14:1--14:10},
articleno = {14},
numpages = {10},
url = {http://doi.acm.org/10.1145/3105762.3105777},
doi = {10.1145/3105762.3105777},
acmid = {3105777},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {GPU, parallel rendering, pattern, sort-middle, static load balancing},
}
Abstract: The rising popularity of the graphics processing unit (GPU) across various numerical computing applications
triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite
great strides, most existing GPU-SpMV approaches trade off one aspect of performance against another. They either require preprocessing,
exhibit inconsistent behavior, lead to execution divergence, suffer load imbalance or induce detrimental memory access patterns. In this
paper, we present an uncompromising approach for SpMV on the GPU. Our approach requires no separate preprocessing or knowledge of the
matrix structure and works directly on the standard compressed sparse rows (CSR) data format. From a global perspective, it exhibits a
homogeneous behavior reflected in efficient memory access patterns and steady per-thread workload. From a local perspective, it avoids
heterogeneous execution paths by adapting its behavior to the work load at hand, it uses an efficient encoding to keep temporary data
requirements for on-chip memory low, and leads to divergence-free execution. We evaluate our approach on more than 2500 matrices
comparing to vendor provided, and state-of-the-art SpMV implementations. Our approach not only significantly outperforms approaches
directly operating on the CSR format ( 20% average performance increase), but also outperforms approaches that preprocess the matrix even
when preprocessing time is discarded. Additionally, the same strategies lead to significant performance increase when adapted for
transpose SpMV.
International Conference on Supercomputing (ICS'17), 2017
@inproceedings{Steinberger:2017:GHL:3079079.3079086,
author = {Steinberger, Markus and Zayer, Rhaleb and Seidel, Hans-Peter},
title = {Globally Homogeneous, Locally Adaptive Sparse Matrix-vector Multiplication on the GPU},
booktitle = {Proceedings of the International Conference on Supercomputing},
series = {ICS '17},
year = {2017},
isbn = {978-1-4503-5020-4},
location = {Chicago, Illinois},
pages = {13:1--13:11},
articleno = {13},
numpages = {11},
url = {http://doi.acm.org/10.1145/3079079.3079086},
doi = {10.1145/3079079.3079086},
acmid = {3079086},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {GPU, SpMV, linear algebra, sparse matrix},
}
Abstract: We introduce a hierarchical sparse matrix representation (HiSparse) tailored for the graphics processing
unit (GPU). The representation adapts to the local nonzero pattern at all levels of the hierarchy and uses reduced bit length
for addressing the entries. This allows a smaller memory footprint than standard formats. Executing algorithms on a hierarchical
structure on the GPU usually entails significant synchronization and management overhead or slowdowns due to diverging execution
paths and memory access patterns. We address these issues by means of a dynamic scheduling strategy specifically designed for
executing algorithms on top of a hierarchical matrix on the GPU. The evaluation of our implementation of basic linear algebra
routines, suggests that our hierarchical format is competitive to highly optimized standard libraries and significantly outperforms
them in the case of transpose matrix operations. The results point towards the viability of hierarchical matrix formats on massively
parallel devices such as the GPU.
International Conference on Supercomputing (ICS'17), 2017
@inproceedings{Derler:2017:DSE:3079079.3079085,
author = {Derler, Andreas and Zayer, Rhaleb and Seidel, Hans-Peter and Steinberger, Markus},
title = {Dynamic Scheduling for Efficient Hierarchical Sparse Matrix Operations on the GPU},
booktitle = {Proceedings of the International Conference on Supercomputing},
series = {ICS '17},
year = {2017},
isbn = {978-1-4503-5020-4},
location = {Chicago, Illinois},
pages = {7:1--7:10},
articleno = {7},
numpages = {10},
url = {http://doi.acm.org/10.1145/3079079.3079085},
doi = {10.1145/3079079.3079085},
acmid = {3079085},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {GPU, hierarchical, linear algebra, sparse matrix},
}
Abstract: In this paper, we show that genetic algorithms (GA) can be used to control the output of procedural modeling algorithms. We
propose an efficient way to encode the choices that have to be made during a procedural generation as a hierarchical genome
representation. In combination with mutation and reproduction operations specifically designed for controlled procedural
modeling, our GA can evolve a population of individual models close to any high-level goal. Possible scenarios include a volume
that should be filled by a procedurally grown tree or a painted silhouette that should be followed by the skyline of a procedurally
generated city. These goals are easy to set up for an artist compared to the tens of thousands of variables that describe the
generated model and are chosen by the GA. Previous approaches for controlled procedural modeling either use Reversible Jump
Markov Chain Monte Carlo (RJMCMC) or Stochastically-Ordered Sequential Monte Carlo (SOSMC) as workhorse for the
optimization. While RJMCMC converges slowly, requiring multiple hours for the optimization of larger models, it produces high
quality models. SOSMC shows faster convergence under tight time constraints for many models, but can get stuck due to choices
made in the early stages of optimization. Our GA shows faster convergence than SOSMC and generates better models than
RJMCMC in the long run.
Computer Graphics Forum / Eurographics (EG'17), 2017
@inproceedings{haubenwallner2017shapegenetics,
title={ShapeGenetics: Using Genetic Algorithms for Procedural Modeling},
author={Haubenwallner, Karl and Seidel, Hans-Peter and Steinberger, Markus},
booktitle={Computer Graphics Forum},
volume={36},
number={2},
year={2017},
organization={Wiley-Blackwell}
}
Abstract: A key advantage of working with structured grids (e.g., images) is the ability to directly tap into the powerful machinery of linear
algebra. This is not much so for unstructured grids where intermediate bookkeeping data structures stand in the way. On modern
high performance computing hardware, the conventional wisdom behind these intermediate structures is further challenged by
costly memory access, and more importantly by prohibitive memory resources on environments such as graphics hardware.
In this paper, we bypass this problem by introducing a sparse matrix representation for unstructured grids which not only
reduces the memory storage requirements but also cuts down on the bulk of data movement from global storage to the compute
units. In order to take full advantage of the proposed representation, we augment ordinary matrix multiplication by means of
action maps, local maps which encode the desired interaction between grid vertices. In this way, geometric computations and
topological modifications translate into concise linear algebra operations. In our algorithmic formulation, we capitalize on the
nature of sparse matrix-vector multiplication which allows avoiding explicit transpose computation and storage. Furthermore,
we develop an efficient vectorization to the demanding assembly process of standard graph and finite element matrices.
Computer Graphics Forum / Eurographics (EG'17), 2017
@inproceedings{zayer2017gpu,
title={A GPU-adapted Structure for Unstructured Grids},
author={Zayer, Rhaleb and Steinberger, Markus and Seidel, Hans-Peter},
booktitle={Computer Graphics Forum},
volume={36},
number={2},
year={2017},
organization={Wiley-Blackwell}
}
Abstract: While the modern graphics processing unit (GPU) offers massive parallel compute power, the ability to influence the scheduling of these immense resources is severely limited. Therefore, the GPU is widely considered to be only suitable as an externally controlled co-processor for homogeneous workloads which greatly restricts the potential applications of GPU computing. To address this issue, we present a new method to achieve fine-grained priority scheduling on the GPU: hierarchical bucket queuing. By carefully distributing the workload among multiple queues and efficiently deciding which queue to draw work from next, we enable a variety of scheduling strategies. These strategies include fair-scheduling, earliest-deadline-first scheduling and user-defined dynamic priority scheduling. In a comparison with a sorting-based approach, we reveal the advantages of hierarchical bucket queuing over previous work. Finally, we demonstrate the benefits of using priority scheduling in real-world applications by example of path tracing and foveated micropolygon rendering.
Computer Graphics Forum, 2016
@article {Kerbl:2016:HBQ,
author = {Kerbl, Bernhard and Kenzel, Michael and Schmalstieg, Dieter and Seidel, Hans-Peter and Steinberger, Markus},
title = {Hierarchical Bucket Queuing for Fine-Grained Priority Scheduling on the GPU},
journal = {Computer Graphics Forum},
issn = {1467-8659},
url = {http://dx.doi.org/10.1111/cgf.13075},
doi = {10.1111/cgf.13075},
pages = {n/a--n/a},
year = {2016},
}
Abstract: In this paper, we present the concept of operator graph scheduling for high performance procedural generation on the graphics processing
unit (GPU). The operator graph forms an intermediate representation that describes all possible operations and objects that can arise
during a specific procedural generation. While previous methods have focused on parallelizing a specific procedural approach, the
operator graph is applicable to all procedural generation methods that can be described by a graph, such as L-systems, shape grammars,
or stack based generation methods. Using the operator graph, we show that all partitions of the graph correspond to possible ways
of scheduling a procedural generation on the GPU, including the scheduling strategies of previous work. As the space of possible
partitions is very large, we describe three search heuristics, aiding an optimizer in finding the fastest valid schedule for any given
operator graph. The best partitions found by our optimizer increase performance of 8 to 30x over the previous state of the art in GPU
shape grammar and L-system generation.
ACM Transactions on Graphics (SIGGRAPH Asia'16), 2016
@article{Boechat:2016:RSP:2980179.2980227,
author = {Pedro Boechat and Mark Dokter and Michael Kenzel and Hans-Peter Seidel and Dieter Schmalstieg and Markus Steinberger},
title = {Representing and Scheduling Procedural Generation using Operator Graphs},
numpages = {12},
url = {http://dx.doi.org/10.1145/2980179.2980227},
doi = {10.1145/2980179.2980227},
acmid = {2980227},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: Sparse matrix vector multiplication (SpMV) is the
workhorse for a wide range of linear algebra computations. In a
serial setting, naive implementations for direct multiplication and
transposed multiplication achieve very competitive performance.
In parallel settings, especially on graphics hardware, it is widely
believed that naive implementations cannot reach the performance
of highly tuned parallel implementations and complex
data formats. Most often, the cost for data conversion to these
specialized formats as well as the cost for transpose operations are
neglected, as they do not arise in all algorithms. In this paper, we
revisit the naive implementation of SpMV for the GPU. Relying
on recent advances in GPU hardware, such as fast hardware
supported atomic operations and better cache performance, we
show that a naive implementation can reach the performance
of state-of-the-art SpMV implementations. In case the cost of
format conversion and transposition cannot be amortized over
many SpMV operations a naive implementation can even outperform
state-of-the-art implementations significantly. Experimental
results over a variety of data sets suggest that the adoption of
the naive serial implementation to the GPU is not as inferior as
it used to be on previous hardware generations. The integration
of some naive strategies can potentially speed up state-of-the-art
GPU SpMV implementations, especially in the transpose case.
HPEC '16 Best Paper Nominee High Performance Extreme Computing, 2016
@inproceedings{steinberger2016naive,
title={How naive is naive SpMV on the GPU?},
author={Steinberger, Markus and Derlery, Andreas and Zayer, Rhaleb and Seidel, Hans-Peter},
booktitle={High Performance Extreme Computing Conference (HPEC), 2016 IEEE},
pages={1--8},
year={2016},
organization={IEEE}
}
Philip Voglreiter, Michael Hofmann, Christoph Ebner, Roberto Blanco Sequeiros, Horst Rupert Portugaller, Jurgen Fütterer, Michael Moche, Markus Steinberger, Dieter Schmalstieg:
Abstract: We present a visualization application supporting interventional radiologists during analysis of simulated minimally invasive
cancer treatment. The current clinical practice employs only rudimentary, manual measurement tools. Our system provides
visual support throughout three evaluation stages, starting with determining prospective treatment success of the simulation
parameterization. In case of insufficiencies, Stage 2 includes a simulation scalar field for determining a new configuration of the
simulation. For complex cases, where Stage 2 does not lead to a decisive strategy, Stage 3 reinforces analysis of interdependencies
of scalar fields via bivariate visualization. Our system is designed to be immediate applicable in medical practice. We analyze
the design space of potentially useful visualization techniques and appraise their effectiveness in the context of our design goals.
Furthermore, we present a user study, which reveals the disadvantages of manual analysis in the measurement stage of evaluation
and highlight the demand for computer-support through our system.
Visual Computing for Biology and Medicine, 2016
@inproceedings {Voglreiter:VES:20161284,
booktitle = {Eurographics Workshop on Visual Computing for Biology and Medicine},
editor = {Stefan Bruckner and Bernhard Preim and Anna Vilanova and Helwig Hauser and Anja Hennemuth and Arvid Lundervold},
title = {{Visualization-Guided Evaluation of Simulated Minimally Invasive Cancer Treatment}},
author = {Voglreiter, Philip and Hofmann, Michael and Ebner, Christoph and Sequeiros, Roberto Blanco and Portugaller, Horst Rupert and Fütterer, Jurgen and Moche, Michael and Steinberger, Markus and Schmalstieg, Dieter},
year = {2016},
publisher = {The Eurographics Association},
ISSN = {2070-5786},
ISBN = {978-3-03868-010-9},
DOI = {10.2312/vcbm.20161284}
}
Abstract: Collaborative filtering collects similar patches, jointly filters them and scatters the output back to
input patches; each pixel gets a contribution from each patch that overlaps with it, allowing signal reconstruction from
highly corrupted data. Exploiting self-similarity, however, requires finding matching image patches, which is an expensive
operation. We propose a GPU-friendly approximated-nearest-neighbour(ANN) algorithm that produces high-quality results for
any type of collaborative filter. We evaluate our ANN search against state-of-the-art ANN algorithms in several application
domains. Our method is orders of magnitudes faster, yet provides similar or higher quality results than the previous work.
Computer Graphics Forum, 2016
@article{Tsai:2016:FAH:12715,
author = {Tsai, Yun-Ta and Steinberger, Markus and Pająk, Dawid and Pulli, Kari},
year = {2016},
title = {Fast ANN for High‐Quality Collaborative Filtering},
journal = {Computer Graphics Forum},
publisher = {Wiley Online Library},
issn = {1467-8659},
doi = {10.1111/cgf.12715},
url = {http:https://dx.doi.org/10.1111/cgf.12715},
volume = {35},
month = {2},
pages = {138--151},
number = {1}
}
Bernhard Kainz, Markus Steinberger, Wolfgang Wein, Maria Kuklisova-Murgasova, Christina Malamateniou, Kevin Keraudren, Thomas Torsney-Weir, Mary Rutherford, Paul Aljabar, Joseph V Hajnal, Daniel Rueckert:
Abstract: Capturing an enclosing volume of moving subjects and organs using fast individual image slice acquisition
has shown promise in dealing with motion artefacts. Motion between slice acquisitions results in spatial inconsistencies that
can be resolved by slice-to-volume reconstruction (SVR) methods to provide high quality 3D image data. Existing algorithms are,
however, typically very slow, specialised to specific applications and rely on approximations, which impedes their potential clinical
use. In this paper, we present a fast multi-GPU accelerated framework for slice-to-volume reconstruction. It is based on optimised
2D/3D registration, super-resolution with automatic outlier rejection and an additional (optional) intensity bias correction. We
introduce a novel and fully automatic procedure for selecting the image stack with least motion to serve as an initial registration
target. We evaluate the proposed method using artificial motion corrupted phantom data as well as clinical data, including tracked
freehand ultrasound of the liver and fetal Magnetic Resonance Imaging. We achieve speed-up factors greater than 30 compared to a
single CPU system and greater than 10 compared to currently available state-of-the-art multi-core CPU methods. We ensure high
reconstruction accuracy by exact computation of the point-spread function for every input data point, which has not previously
been possible due to computational limitations. Our framework and its implementation is scalable for available computational
infrastructures and tests show a speed-up factor of 1.70 for each additional GPU. This paves the way for the online application
of image based reconstruction methods during clinical examinations. The source code for the proposed approach is publicly available.
IEEE Transactions on Medical Imaging, 2015
@article{Kainz:FVR:7064742,
author={B. Kainz and M. Steinberger and W. Wein and M. Kuklisova-Murgasova and C. Malamateniou and K. Keraudren and T. Torsney-Weir and M. Rutherford and P. Aljabar and J. V. Hajnal and D. Rueckert},
journal={IEEE Transactions on Medical Imaging},
title={Fast Volume Reconstruction From Motion Corrupted Stacks of 2D Slices},
year={2015},
volume={34},
number={9},
pages={1901-1913},
doi={10.1109/TMI.2015.2415453},
ISSN={0278-0062},
month={Sept}
}
Abstract: We present an approach for the automatic generation, interactive exploration
and real-time modification of disassembly procedures for complex, multipartite CAD data sets.
In order to lift the performance barriers prohibiting interactive disassembly planning, we run
a detailed analysis on the input model to identify recurring part constellations and efficiently
determine blocked part motions in parallel on the GPU. Building on the extracted information, we
present an interface for computing and editing extensive disassembly sequences in real-time while
considering user-defined constraints and avoiding unstable configurations. To evaluate the performance
of our C++/CUDA implementation, we use a variety of openly available CAD data sets, ranging from
simple to highly complex. In contrast to previous approaches, our work enables interactive disassembly
planning for objects which consist of several thousand parts and require cascaded translations during part removal.
Computer Graphics Forum / Eurographics (EG'15), 2015
@article {Kerbl:2015:IDP:CGF12560,
author = {Kerbl, Bernhard and Kalkofen, Denis and Steinberger, Markus and Schmalstieg, Dieter},
title = {Interactive Disassembly Planning for Complex Objects},
journal = {Computer Graphics Forum},
volume = {34},
number = {2},
issn = {1467-8659},
url = {http://dx.doi.org/10.1111/cgf.12560},
doi = {10.1111/cgf.12560},
pages = {287--297},
year = {2015},
}
Abstract: In this paper, we present a series of scheduling approaches
targeted for massively parallel architectures, which in combination allow a wider
range of algorithms to be executed on modern graphics processors. At first, we
describe a new processing model which enables the efficient execution of dynamic,
irregular workloads. Then, we present the currently fastest queuing algorithm for graphics
processors, the most efficient dynamic memory allocator for massively parallel architectures,
and the only autonomous scheduler for graphics processing units that can dynamically support
different granularities of parallelism. Finally, we show how these scheduling approaches help to advance
the state-of-the-art in the rendering, visualization and procedural modeling.
it - Information Technology, 2015
@article{steinberger2015overview,
title={An overview of dynamic resource scheduling on graphics processors},
author={Steinberger, Markus},
journal={it-Information Technology},
volume={57},
number={2},
pages={138--143},
year={2015}
}
Abstract: In this paper we investigate the possibility of real-time Reyes rendering with advanced effects
such as displacement mapping and multidimensional rasterization on current graphics hardware. We describe a first GPU
Reyes implementation executing within an autonomously executing persistent Megakernel. To support high quality rendering,
we integrate displacement mapping into the renderer, which has only marginal impact on performance. To investigate
rasterization for Reyes, we start with an approach similar to nearest neighbor filling, before presenting a precise
sampling algorithm. This algorithm can be enhanced to support motion blur and depth of field using three dimensional
sampling. To evaluate the performance quality trade-off of these effects, we compare three approaches: coherent sampling
across time and on the lens, essentially overlaying images; randomized sampling along all dimensions; and repetitive
randomization, in which the randomization pattern is repeated among subgroups of pixels. We evaluate all approaches, showing
that high quality images can be generated with interactive to real-time refresh rates for advanced Reyes features.
Spring Conference on Computer Graphics, 2015
@inproceedings{Sattlecker:2015:RRG:2788539.2788543,
author = {Sattlecker, Martin and Steinberger, Markus},
title = {Reyes Rendering on the GPU},
booktitle = {Proceedings of the 31st Spring Conference on Computer Graphics},
series = {SCCG '15},
year = {2015},
isbn = {978-1-4503-3693-2},
location = {Smolenice, Slovakia},
pages = {31--38},
numpages = {8},
url = {http://doi.acm.org/10.1145/2788539.2788543},
doi = {10.1145/2788539.2788543},
acmid = {2788543},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: A computer implemented method of performing an approximate-nearest-neighbor search is disclosed.
The method comprises dividing an image into a plurality of tiles. Further, for each of the plurality of tiles,
perform the following in parallel on a processor: (a) dividing image patches into a plurality of clusters,
wherein each cluster comprises similar images patches, and wherein the dividing continues recursively until
a size of a cluster is below a threshold value; (b) performing a nearest-neighbor query within each of the
plurality of clusters; and (c) performing collaborative filtering in parallel for each image patch, wherein
the collaborative filtering aggregates and processes nearest neighbor image patches from a same cluster containing
a respective image patch to form an output image
US Patent, 2015
@misc{pajak2015efficient,
title={Efficient approximate-nearest-neighbor (ann) search for high-quality collaborative filtering},
author={PAJAK, D.S. and TSAI, Y.T. and Steinberger, M.},
url={https://www.google.com/patents/US20150206285},
year={2015},
month=jul # "~23",
note={US Patent App. 14/632,782}
}
Abstract: Conventional pipelines for capturing, displaying, and storing
images are usually defined as a series of cascaded modules, each responsible for
addressing a particular problem. While this divide-and-conquer approach offers many
benefits, it also introduces a cumulative error, as each step in the pipeline only
considers the output of the previous step, not the original sensor data. We propose
an end-to-end system that is aware of the camera and image model, enforces natural-image
priors, while jointly accounting for common image processing steps like demosaicking,
denoising, deconvolution, and so forth, all directly in a given output representation
(e.g., YUV, DCT). Our system is flexible and we demonstrate it on regular Bayer images as
well as images from custom sensors. In all cases, we achieve large improvements in image
quality and signal reconstruction compared to state-of-the-art techniques. Finally, we show
that our approach is capable of very efficiently handling high-resolution images, making
even mobile implementations feasible.
ACM Transactions on Graphics (SIGGRAPH Asia'14), 2014
@article{Steinberger:2014:WTS:2661229.2661250,
author = {Steinberger, Markus and Kenzel, Michael and Boechat, Pedro and Kerbl, Bernhard and Dokter, Mark and Schmalstieg, Dieter},
title = {Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU},
journal = {ACM Trans. Graph.},
issue_date = {November 2014},
volume = {33},
number = {6},
month = nov,
year = {2014},
issn = {0730-0301},
pages = {228:1--228:11},
articleno = {228},
numpages = {11},
url = {http://doi.acm.org/10.1145/2661229.2661250},
doi = {10.1145/2661229.2661250},
acmid = {2661250},
publisher = {ACM},
address = {New York, NY, USA},
}
Felix Heide, Markus Steinberger, Yun-Ta Tsai, Nasa Rouf, Dawid Pajak, Dikpal Reddy, Orazio Gallo, Jing Liu, Wolfgang Heidrich, Karen Egiazarian, Jan Kautz, Kari Pulli:
Abstract: Conventional pipelines for capturing, displaying, and storing
images are usually defined as a series of cascaded modules, each responsible for
addressing a particular problem. While this divide-and-conquer approach offers many
benefits, it also introduces a cumulative error, as each step in the pipeline only
considers the output of the previous step, not the original sensor data. We propose
an end-to-end system that is aware of the camera and image model, enforces natural-image
priors, while jointly accounting for common image processing steps like demosaicking,
denoising, deconvolution, and so forth, all directly in a given output representation
(e.g., YUV, DCT). Our system is flexible and we demonstrate it on regular Bayer images as
well as images from custom sensors. In all cases, we achieve large improvements in image
quality and signal reconstruction compared to state-of-the-art techniques. Finally, we show
that our approach is capable of very efficiently handling high-resolution images, making
even mobile implementations feasible.
ACM Transactions on Graphics (SIGGRAPH Asia'14), 2014
@article{Heide:2014:FFC:2661229.2661260,
author = {Heide, Felix and Steinberger, Markus and Tsai, Yun-Ta and Rouf, Mushfiqur and Paj\k{a}k, Dawid and Reddy, Dikpal and Gallo, Orazio and Liu, Jing and Heidrich, Wolfgang and Egiazarian, Karen and Kautz, Jan and Pulli, Kari},
title = {FlexISP: A Flexible Camera Image Processing Framework},
journal = {ACM Trans. Graph.},
issue_date = {November 2014},
volume = {33},
number = {6},
month = nov,
year = {2014},
issn = {0730-0301},
pages = {231:1--231:13},
articleno = {231},
numpages = {13},
url = {http://doi.acm.org/10.1145/2661229.2661260},
doi = {10.1145/2661229.2661260},
acmid = {2661260},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: Collaborative filtering collects similar patches, jointly filters them, and scatters the output back to input patches;
each pixel gets a contribution from each patch that overlaps with it, allowing signal reconstruction from highly
corrupted data. Exploiting self-similarity, however, requires finding matching image patches, which is an expensive
operation. We propose a GPU-friendly approximated-nearest-neighbor algorithm that produces high-quality results
for any type of collaborative filter. We evaluate our ANN search against state-of-the-art ANN algorithms in several
application domains. Our method is orders of magnitudes faster, yet provides similar or higher-quality results than
the previous work.
HPG'14 Best Paper Award High Performance Graphics (HPG '14), 2014
@inproceedings {Tsai:FAH:2014,
booktitle = {Eurographics/ ACM SIGGRAPH Symposium on High Performance Graphics},
editor = {Ingo Wald and Jonathan Ragan-Kelley},
title = {{Fast ANN for High-Quality Collaborative Filtering}},
author = {Tsai, Yun-Ta and Steinberger, Markus and Pajak, Dawid and Pulli, Kari},
year = {2014},
publisher = {The Eurographics Association},
ISSN = {2079-8679},
ISBN = {978-3-905674-60-6},
DOI = {10.2312/hpg.20141094}
}
Abstract: We propose a technique to build the irradiance cache for isotropic scattering simultaneously
with Monte Carlo progressive direct volume rendering on a single GPU, which allows us to achieve up to four times
increased convergence rate for complex scenes with arbitrary sources of light. We use three procedures that run
concurrently on a single GPU. The first is the main rendering procedure. The second procedure computes new cache
entries, and the third one corrects the errors that may arise after creation of new cache entries. We propose two
distinct approaches to allow massive parallelism of cache entry creation. In addition, we show a novel extrapolation
approach which outputs high quality irradiance approximations and a suitable prioritization scheme to increase the
convergence rate by dedicating more computational power to more complex rendering areas.
Computer Graphics Forum / EuroVis, 2014
@article {Khlebnikov:PIC:2014,
author = {Khlebnikov, Rostislav and Voglreiter, Philip and Steinberger, Markus and Kainz, Bernhard and Schmalstieg, Dieter},
title = {Parallel Irradiance Caching for Interactive Monte-Carlo Direct Volume Rendering},
journal = {Computer Graphics Forum},
volume = {33},
number = {3},
issn = {1467-8659},
url = {http://dx.doi.org/10.1111/cgf.12362},
doi = {10.1111/cgf.12362},
pages = {61--70},
year = {2014},
}
Abstract: Content on computer screens is often inaccessible to users because it is hidden, e.g., occluded by other windows,
outside the viewport, or overlooked. In search tasks, the efficient retrieval of sought content is important. Current software, however,
only provides limited support to visualize hidden occurrences and rarely supports search synchronization crossing application boundaries.
To remedy this situation, we introduce two novel visualization methods to guide users to hidden content. Our first method generates
awareness for occluded or out-of-viewport content using see-through visualization. For content that is either outside the screen's
viewport or for data sources not opened at all, our second method shows off-screen indicators and an on-demand smart preview. To reduce
the chances of overlooking content, we use visual links, i.e., visible edges, to connect the visible content or the visible representations
of the hidden content. We show the validity of our methods in a user study, which demonstrates that our technique enables a faster
localization of hidden content compared to traditional search functionality and thereby assists users in information retrieval tasks.
CHI '14 Honorable Mention Award SIGCHI Conference on Human Factors in Computing Systems (CHI'14), 2014
@inproceedings{Geymayer:2014:SMI:2556288.2557032,
author = {Geymayer, Thomas and Steinberger, Markus and Lex, Alexander and Streit, Marc and Schmalstieg, Dieter},
title = {Show Me the Invisible: Visualizing Hidden Content},
booktitle = {Proceedings of the SIGCHI Conference on Human Factors in Computing Systems},
series = {CHI '14},
year = {2014},
isbn = {978-1-4503-2473-1},
location = {Toronto, Ontario, Canada},
pages = {3705--3714},
numpages = {10},
url = {http://doi.acm.org/10.1145/2556288.2557032},
doi = {10.1145/2556288.2557032},
acmid = {2557032},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {hidden content, occluded content, off-screen content, visual linking},
}
Abstract: In this paper, we present a new approach for shape-grammar-based generation and rendering of huge cities in
real-time on the graphics processing unit GPU. Traditional approaches rely on evaluating a shape grammar and storing the geometry
produced as a preprocessing step. During rendering, the pregenerated data is then streamed to the GPU. By interweaving generation
and rendering, we overcome the problems and limitations of streaming pregenerated data. Using our methods of visibility pruning
and adaptive level of detail, we are able to dynamically generate only the geometry needed to render the current view in real-time
directly on the GPU. We also present a robust and efficient way to dynamically update a scene's derivation tree and geometry,
enabling us to exploit frame-to-frame coherence. Our combined generation and rendering is significantly faster than all previous
work. For detailed scenes, we are capable of generating geometry more rapidly than even just copying pregenerated data from main
memory, enabling us to render cities with thousands of buildings at up to 100 frames per second, even with the camera moving at supersonic speed.
Computer Graphics Forum / Eurographics (EG'14), 2014
@article{Steinberger:2014:OGR:2771441.2771452,
author = {Steinberger, Markus and Kenzel, Michael and Kainz, Bernhard and Wonka, Peter and Schmalstieg, Dieter},
title = {On-the-fly Generation and Rendering of Infinite Cities on the GPU},
journal = {Comput. Graph. Forum},
issue_date = {May 2014},
volume = {33},
number = {2},
month = may,
year = {2014},
issn = {0167-7055},
pages = {105--114},
numpages = {10},
url = {http://dx.doi.org/10.1111/cgf.12315},
doi = {10.1111/cgf.12315},
acmid = {2771452},
publisher = {The Eurographs Association \&\#38; John Wiley \&\#38; Sons, Ltd.},
address = {Chichester, UK},
}
Abstract: In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars
on the graphics processing unit GPU. Unlike previous approaches that are either limited in the kind of shapes they allow,
the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling
including stochasticity and context-sensitivity. To increase parallelism, we explicitly express independence in the grammar,
reduce inter-rule dependencies required for context-sensitive evaluation, and introduce intra-rule parallelism. Our rule
scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by
dynamically grouping rules in on-chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude
faster than the standard in CPU-based rule evaluation, while offering equal expressive power. In comparison to the state
of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context-sensitivity.
Honorable Mention Award (among top 3 papers) Computer Graphics Forum / Eurographics (EG'14), 2014
@article{Steinberger:2014:PGA:2771441.2771449,
author = {Steinberger, Markus and Kenzel, Michael and Kainz, Bernhard and M"{u}ller, J"{o}rg and Peter, Wonka and Schmalstieg, Dieter},
title = {Parallel Generation of Architecture on the GPU},
journal = {Comput. Graph. Forum},
issue_date = {May 2014},
volume = {33},
number = {2},
month = may,
year = {2014},
issn = {0167-7055},
pages = {73--82},
numpages = {10},
url = {http://dx.doi.org/10.1111/cgf.12312},
doi = {10.1111/cgf.12312},
acmid = {2771449},
publisher = {The Eurographs Association \&\#38; John Wiley \&\#38; Sons, Ltd.},
address = {Chichester, UK},
}
Abstract: Analysis of multivariate data is of great importance in many scientific disciplines. However, visualization of 3D
spatially-fixed multivariate volumetric data is a very challenging task. In this paper we present a method that allows simultaneous
real-time visualization of multivariate data. We redistribute the opacity within a voxel to improve the readability of the color defined
by a regular transfer function, and to maintain the see-through capabilities of volume rendering. We use predictable procedural
noise--random-phase Gabor noise--to generate a high-frequency redistribution pattern and construct an opacity mapping function,
which allows to partition the available space among the displayed data attributes. This mapping function is appropriately filtered
to avoid aliasing, while maintaining transparent regions. We show the usefulness of our approach on various data sets and with different
example applications. Furthermore, we evaluate our method by comparing it to other visualization techniques in a controlled user study.
Overall, the results of our study indicate that users are much more accurate in determining exact data values with our novel 3D volume
visualization method. Significantly lower error rates for reading data values and high subjective ranking of our method imply that it
has a high chance of being adopted for the purpose of visualization of multivariate 3D data.
Transactions on Visualization and Computer Graphics (Vis'13), 2013
@ARTICLE{Khlebnikov:2013:NBV,
author={R. Khlebnikov and B. Kainz and M. Steinberger and D. Schmalstieg},
journal={IEEE Transactions on Visualization and Computer Graphics},
title={Noise-Based Volume Rendering for the Visualization of Multivariate Volumetric Data},
year={2013},
volume={19},
number={12},
pages={2926-2935},
doi={10.1109/TVCG.2013.180},
ISSN={1077-2626},
month={Dec},}
Abstract: Modern GPUs are powerful enough to enable interactive display of
high-quality volume data even despite the fact that many volume rendering methods do not present a natural fit for current GPU hardware.
However, there still is a vast amount of computational power that remains unused due to the inefficient use of the available hardware. In
this work, we demonstrate how advanced scheduling methods can be employed to implement volume rendering algorithms in a way that
better utilizes the GPU by example of three different state-of-the-art volume rendering techniques
@misc{Voglreiter:2013:VRA,
title={Volume Rendering with advanced GPU scheduling strategies},
author={Voglreiter, Philip and Steinberger, Markus and Khlebnikov, Rostislav and Kainz, Bernhard and Schmalstieg, Dieter},
year={2013},
publisher={IEEE SciVis Poster, Best Poster (Honorable Mention), Atlanta, GA, USA}
}
Abstract: In Augmented Reality (AR), ghosted views allow a viewer to explore hidden structure within the real-world environment.
A body of previous work has explored which features are suitable to support the structural interplay between occluding and occluded elements.
However, the dynamics of AR environments pose serious challenges to the presentation of ghosted views. While a model of the real world may
help determine distinctive structural features, changes in appearance or illumination detriment the composition of occluding and occluded
structure. In this paper, we present an approach that considers the information value of the scene before and after generating the ghosted
view. Hereby, a contrast adjustment of preserved occluding features is calculated, which adaptively varies their visual saliency within the
ghosted view visualization. This allows us to not only preserve important features, but to also support their prominence after
revealing occluded structure, thus achieving a positive effect on the perception of ghosted views.
International Symposium on Mixed and Augmented Reality (ISMAR'13), 2013
@INPROCEEDINGS{Kalkofen:2013:AGV,
author={D. Kalkofen and E. Veas and S. Zollmann and M. Steinberger and D. Schmalstieg},
booktitle={Mixed and Augmented Reality (ISMAR), 2013 IEEE International Symposium on},
title={Adaptive ghosted views for Augmented Reality},
year={2013},
pages={1-9},
keywords={augmented reality;data visualisation;AR environment dynamics;adaptive ghosted view visualization;augmented reality;occluding features contrast adjustment;occluding-occluded elements structural interplay;structural features;visual saliency;Augmented reality;Color;Image color analysis;Rendering (computer graphics);Solid modeling;Three-dimensional displays;Visualization},
doi={10.1109/ISMAR.2013.6671758},
month={Oct},
}
Bernhard Kainz, Stefan Hauswiesner, Gerhard Reitmayr, Markus Steinberger, Raphael Grasset, Lukas Gruber, Eduardo Veas, Denis Kalkofen, Hartmut Seichter, Dieter Schmalstieg:
Abstract: Real-time three-dimensional acquisition of real-world scenes has many important applications in computer graphics,
computer vision and human-computer interaction. Inexpensive depth sensors such as the Microsoft Kinect allow to leverage the development
of such applications. However, this technology is still relatively recent, and no detailed studies on its scalability to dense and
view-independent acquisition have been reported. This paper addresses the question of what can be done with a larger number of Kinects
used simultaneously. We describe an interference-reducing physical setup, a calibration procedure and an extension to the KinectFusion
algorithm, which allows to produce high quality volumetric reconstructions from multiple Kinects whilst overcoming systematic errors in
the depth measurements. We also report on enhancing image based visual hull rendering by depth measurements, and compare the results to
KinectFusion. Our system provides practical insight into achievable spatial and radial range and into bandwidth requirements for depth
data acquisition. Finally, we present a number of practical applications of our system.
Symposium On Virtual Reality Software And Technology (VRST'12), 2012
@inproceedings{Kainz:2012:ORD:2407336.2407342,
author = {Kainz, Bernhard and Hauswiesner, Stefan and Reitmayr, Gerhard and Steinberger, Markus and Grasset, Raphael and Gruber, Lukas and Veas, Eduardo and Kalkofen, Denis and Seichter, Hartmut and Schmalstieg, Dieter},
title = {OmniKinect: Real-time Dense Volumetric Data Acquisition and Applications},
booktitle = {Proceedings of the 18th ACM Symposium on Virtual Reality Software and Technology},
series = {VRST '12},
year = {2012},
isbn = {978-1-4503-1469-5},
location = {Toronto, Ontario, Canada},
pages = {25--32},
numpages = {8},
url = {http://doi.acm.org/10.1145/2407336.2407342},
doi = {10.1145/2407336.2407342},
acmid = {2407342},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {4D reconstruction, depth sensors, microsoft kinect},
}
Abstract: In this paper we present Softshell, a novel execution model for devices composed of multiple processing cores
operating in a single instruction, multiple data fashion, such as graphics processing units (GPUs). The Softshell model is intuitive
and more flexible than the kernel-based adaption of the stream processing model, which is currently the dominant model for general
purpose GPU computation. Using the Softshell model, algorithms with a relatively low local degree of parallelism can execute efficiently
on massively parallel architectures. Softshell has the following distinct advantages: (1) work can be dynamically issued directly on the
device, eliminating the need for synchronization with an external source, i.e., the CPU; (2) its three-tier dynamic scheduler supports
arbitrary scheduling strategies, including dynamic priorities and real-time scheduling; and (3) the user can influence, pause, and cancel
work already submitted for parallel execution. The Softshell processing model thus brings capabilities to GPU architectures that were
previously only known from operating-system designs and reserved for CPU programming. As a proof of our claims, we present a publicly
available implementation of the Softshell processing model realized on top of CUDA. The benchmarks of this implementation demonstrate
that our processing model is easy to use and also performs substantially better than the state-of-the-art kernel-based processing model
for problems that have been difficult to parallelize in the past.
ACM Transactions on Graphics (SIGGRAPH Asia'12), 2012
@article{Steinberger:2012:SDS:2366145.2366180,
author = {Steinberger, Markus and Kainz, Bernhard and Kerbl, Bernhard and Hauswiesner, Stefan and Kenzel, Michael and Schmalstieg, Dieter},
title = {Softshell: Dynamic Scheduling on {GPUs}},
journal = {ACM Trans. Graph.},
issue_date = {November 2012},
volume = {31},
number = {6},
month = nov,
year = {2012},
issn = {0730-0301},
pages = {161:1--161:11},
articleno = {161},
numpages = {11},
url = {http://doi.acm.org/10.1145/2366145.2366180},
doi = {10.1145/2366145.2366180},
acmid = {2366180},
publisher = {ACM},
address = {New York, NY, USA}
}
Abstract: In this paper we propose a particle-based volume rendering approach for unstructured,
three-dimensional, tetrahedral polygon meshes. We stochastically generate millions of particles per second
and project them on the screen in real-time. In contrast to previous rendering techniques of tetrahedral
volume meshes, our method does not need a prior depth sorting of geometry. Instead, the rendered image
is generated by choosing particles closest to the camera. Furthermore, we use spatial superimposing. Each
pixel is constructed from multiple subpixels. This approach not only increases projection accuracy, but
allows also a combination of subpixels into one superpixel that creates the well-known translucency effect
of volume rendering. We show that our method is fast enough for the visualization of unstructured three-dimensional
grids with hard real-time constraints and that it scales well for a high number of particles.
Mesh Processing in Medical Image Analysis (MICCAI'12), 2012
@Inbook{Voglreiter:2012:VRP,
author={Voglreiter, Philip and Steinberger, Markus and Schmalstieg, Dieter and Kainz, Bernhard},
editor={Levine, Joshua A. and Paulsen, Rasmus R. and Zhang, Yongjie},
title={Volumetric Real-Time Particle-Based Representation of Large Unstructured Tetrahedral Polygon Meshes},
bookTitle={Mesh Processing in Medical Image Analysis 2012: MICCAI 2012 International Workshop, MeshMed 2012, Nice, France, October 1, 2012. Proceedings},
year={2012},
publisher={Springer Berlin Heidelberg},
address={Berlin, Heidelberg},
pages={159--168},
isbn={978-3-642-33463-4},
doi={10.1007/978-3-642-33463-4_16},
url={http://dx.doi.org/10.1007/978-3-642-33463-4_16}
}
Abstract: Simultaneous visualization of multiple continuous data attributes in
a single visualization is a task that is important for many application areas. Unsurprisingly,
many methods have been proposed to solve this task. However, the behavior of such methods
during the exploration stage, when the user tries to understand the data with panning and
zooming, has not been given much attention. In this paper, we propose a method that uses
procedural texture synthesis to create zoom-independent visualizations of three scalar data
attributes. The method is based on random-phase Gabor noise, whose frequency is adapted for
the visualization of the first data attribute. We ensure that the resulting texture frequency
lies in the range that is perceived well by the human visual system at any zoom level. To
enhance the perception of this attribute, we also apply a specially constructed transfer
function that is based on statistical properties of the noise. Additionally, the transfer
function is constructed in a way that it does not introduce any aliasing to the texture. We
map the second attribute to the texture orientation. The third attribute is color coded and
combined with the texture by modifying the value component of the HSV color model. The necessary
contrast needed for texture and color perception was determined in a user study. In addition, we
conducted a second user study that shows significant advantages of our method over current methods
with similar goals. We believe that our method is an important step towards creating methods that
not only succeed in visualizing multiple data attributes, but also adapt to the behavior of the
user during the data exploration stage.
Computer Graphics Forum (EuroVis12), 2012
@article {Khlebnikov:2012:PTS,
journal = {Computer Graphics Forum},
title = {{Procedural Texture Synthesis for Zoom-Independent Visualization of Multivariate Data}},
author = {Khlebnikov, Rostislav and Kainz, Bernhard and Steinberger, Markus and Streit, Marc and Schmalstieg, Dieter},
year = {2012},
publisher = {The Eurographics Association and Blackwell Publishing Ltd.},
ISSN = {1467-8659},
DOI = {10.1111/j.1467-8659.2012.03127.x}
}
Abstract: In this paper, we present the design and implementation of a dynamic window
management technique that changes the perception of windows as fixed-sized rectangles.
The primary goal of self-organizing windows is to automatically display the most relevant
information for a user's current activity, which removes the burden of organizing and arranging
windows from the user. We analyze the image-based representation of each window and identify
coherent pieces of information. The windows are then automatically moved, scaled and composed in
a contentaware manner to fit the most relevant information into the limited area of the screen.
During the design process, we consider findings from previous experiments and show how users can
benefit from our system. We also describe how the immense processing power of current graphics
processing units can be exploited to build an interactive system that finds an optimal solution
within the complex design space of all possible window transformations in real time.
Computer Graphics Forum / Eurographics (EG'12), 2012
@article{Steinberger:2012:ISW,
journal = {Computer Graphics Forum},
title = {{Interactive Self-Organizing Windows}},
author = {Steinberger, Markus and Waldner, Manuela and Schmalstieg, Dieter},
year = {2012},
publisher = {The Eurographics Association and John Wiley and Sons Ltd.},
ISSN = {1467-8659},
DOI = {10.1111/j.1467-8659.2012.03041.x}
}
Abstract: In this paper, we analyze the special requirements of a dynamic
memory allocator that is designed for massively parallel architectures such as Graphics
Processing Units (GPUs). We show that traditional strategies, which work well on CPUs,
are not well suited for the use on GPUs and present the thorough design of ScatterAlloc,
which can efficiently deal with hundreds of requests in parallel.
Our allocator greatly reduces collisions and congestion by scattering memory requests
based on hashing. We analyze ScatterAlloc in terms of allocation speed, data access
time and fragmentation, and compare it to current state-of-the-art allocators, including
the one provided with the NVIDIA CUDA toolkit. Our results show, that ScatterAlloc
clearly outperforms these other approaches, yielding speed-ups between 10 to 100.
Innovative Parallel Computing (InPar'12), 2012
@INPROCEEDINGS{Steinberger:2012:SMP:6339604,
author={Markus Steinberger and Michael Kenzel and Bernhard Kainz and Dieter Schmalstieg},
booktitle={Innovative Parallel Computing (InPar), 2012},
title={ScatterAlloc: Massively parallel dynamic memory allocation for the {GPU}},
year={2012},
pages={1-10},
doi={10.1109/InPar.2012.6339604},
month={May}
}
Abstract: Many virtual mirror and telepresence applications require novel viewpoint synthesis with little latency to user
motion. Image-based visual hull (IBVH) rendering is capable of rendering arbitrary views from segmented images without an explicit
intermediate data representation, such as a mesh or a voxel grid. By computing depth images directly from the silhouette images,
it usually outperforms indirect methods. GPU-hardware accelerated implementations exist, but due to the lack of an intermediate
representation no multi-GPU parallel strategies and implementations are currently available. This paper suggests three ways to
parallelize the IBVH-pipeline and maps them to the sorting classification that is often applied to conventional parallel rendering
systems. In addition to sort-first parallelization, we suggest a novel sort-last formulation that regards cameras as scene objects.
We enhance this method’s performance by a block-based encoding of the rendering results. For interactive systems with hard real-time
constraints, we combine the algorithm with a multi-frame rate (MFR) system. We suggest a combination of forward and backward image
warping to improve the visual quality of the MFR rendering. We observed the runtime behavior of the suggested methods and assessed
how their performance scales with respect to input and output resolutions and the number of GPUs. By using additional GPUs, we reduced
rendering times by up to 60%. Multi-frame rate viewing can even be ten times faster.
Eurographics Symposium on Parallel Graphics and Visualization (EGPGV'12), 2012
@inproceedings {Hauswiesner:2012:MIV,
title = {{Multi-GPU Image-based Visual Hull Rendering}},
author = {Hauswiesner, Stefan and Khlebnikov, Rostislav and Steinberger, Markus and Straka, Matthias and Reitmayr, Gerhard},
booktitle = {Eurographics Symposium on Parallel Graphics and Visualization},
editor = {Hank Childs and Torsten Kuhlen and Fabio Marton},
year = {2012},
publisher = {The Eurographics Association},
ISSN = {1727-348X},
ISBN = {978-3-905674-35-4},
DOI = {10.2312/EGPGV/EGPGV12/119-128},
}
Abstract: This paper presents a new method to control scene sampling in complex ray-based rendering
environments. It proposes to constrain image sampling density with a combination of object features,
which are known to be well perceived by the human visual system, and image space saliency, which
captures effects that are not based on the object’s geometry. The presented method uses Non-
Photorealistic Rendering techniques for the object space feature evaluation and combines the image
space saliency calculations with image warping to infer quality hints from previously generated frames.
In order to map different feature types to sampling densities, we also present an evaluation of the
object space and image space features’ impact on the resulting image quality. In addition, we present an
efficient, adaptively aligned fractal pattern that is used to reconstruct the image from sparse sampling
data. Furthermore, this paper presents an algorithm which uses our method in order to guarantee a
desired minimal frame rate. Our scheduling algorithm maximizes the utilization of each given time
slice by rendering features in the order of visual importance values until a time constraint is reached.
We demonstrate how our method can be used to boost or stabilize the rendering time in complex ray-
based image generation consisting of geometric as well as volumetric data.
Computers & Graphics, 2012
@article{Steinberger:2012:RPS,
title = {Ray prioritization using stylization and visual saliency},
author = {Markus Steinberger and Bernhard Kainz and Stefan Hauswiesner and Rostislav Khlebnikov and Denis Kalkofen and Dieter Schmalstieg}
journal = {Computers & Graphics},
volume = {36},
number = {6},
pages = {673 - 684},
year = {2012},
note = {2011 Joint Symposium on Computational Aesthetics (CAe), Non-Photorealistic Animation and Rendering (NPAR), and Sketch-Based Interfaces and Modeling (SBIM)},
issn = {0097-8493},
doi = {http://dx.doi.org/10.1016/j.cag.2012.03.037},
url = {http://www.sciencedirect.com/science/article/pii/S0097849312000854},
}
Abstract: Current projectors can easily be combined to create an everywhere display,
using all suitable surfaces in offices or meeting rooms for the presentation of information. However,
the resulting irregular display is not well supported by tradi tional desktop window managers, which are
optimized for rectangular screens. In this paper, we present novel display-adaptive window management
techniques, which provide semi-automatic placement for desktop elements (such as windows or icons)
for users of large, irregularly shaped displays. We report results from an exploratory study, which reveals
interesting emerging strategies of users in the manipulation of windows on large irregular displays and shows that
the new techniques increase subjective satisfaction with the window management interface.
Interactive Tabletops and Surfaces (ITS'11), 2011
@inproceedings{Waldner:2011:DWM:2076354.2076394,
author = {Waldner, Manuela and Grasset, Raphael and Steinberger, Markus and Schmalstieg, Dieter},
title = {Display-adaptive Window Management for Irregular Surfaces},
booktitle = {Proceedings of the ACM International Conference on Interactive Tabletops and Surfaces},
series = {ITS '11},
year = {2011},
isbn = {978-1-4503-0871-7},
location = {Kobe, Japan},
pages = {222--231},
numpages = {10},
url = {http://doi.acm.org/10.1145/2076354.2076394},
doi = {10.1145/2076354.2076394},
acmid = {2076394},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {irregular displays, window management},
}
Abstract: Evaluating, comparing, and interpreting related pieces of information are tasks that are commonly performed during visual
data analysis and in many kinds of information-intensive work. Synchronized visual highlighting of related elements is a well-known
technique used to assist this task. An alternative approach, which is more invasive but also more expressive is visual linking in which
line connections are rendered between related elements. In this work, we present context-preserving visual links as a new method for
generating visual links. The method specifically aims to fulfill the following two goals: first, visual links should minimize the occlusion
of important information; second, links should visually stand out from surrounding information by minimizing visual interference.
We employ an image-based analysis of visual saliency to determine the important regions in the original representation. A consequence
of the image-based approach is that our technique is application-independent and can be employed in a large number of
visual data analysis scenarios in which the underlying content cannot or should not be altered.
We conducted a controlled experiment that indicates that users can find linked elements in complex visualizations more quickly and
with greater subjective satisfaction than in complex visualizations in which plain highlighting is used. Context-preserving visual links
were perceived as visually more attractive than traditional visual links that do not account for the context information.
InfoVis '11 Best Paper Award IEEE Transactions on Visualization and Computer Graphics (InfoVis '11), 2011
@ARTICLE{Steinberger:2011:CVL:TVCG.2011.183,
author={Steinberger, M. and Waldner, M. and Streit, M. and Lex, A. and Schmalstieg, D.},
journal={Visualization and Computer Graphics, IEEE Transactions on},
title={Context-Preserving Visual Links},
year={2011},
volume={17},
number={12},
pages={2249-2258},
keywords={data analysis;data visualisation;complex visualizations;context preserving visual links;image based analysis;information intensive work;information occlusion;synchronized visual highlighting;visual data analysis;visual interference;visual saliency;Data visualization;Histograms;Image color analysis;Visual links;connectedness;highlighting;image-based;routing;saliency.},
doi={10.1109/TVCG.2011.183},
ISSN={1077-2626},
month={Dec},
}
Abstract: This paper presents a new method to control graceful scene degradation in complex ray-based rendering
environments. It proposes to constrain the image sampling density with object features, which are known to support the
comprehension of the three-dimensional shape. The presented method uses Non-Photorealistic Rendering (NPR) techniques to
extract features such as silhouettes, suggestive contours, suggestive highlights, ridges and valleys. To map different
feature types to sampling densities, we also present an evaluation of the features impact on the resulting image quality.
To reconstruct the image from sparse sampling data, we use linear interpolation on an adaptively aligned fractal pattern.
With this technique, we are able to present an algorithm that guarantees a desired minimal frame rate without much loss
of image quality. Our scheduling algorithm maximizes the use of each given time slice by rendering features in order of
their corresponding importance values until a time constraint is reached. We demonstrate how our method can be used to
boost and guarantee the rendering time in complex ray-based environments consisting of geometric as well as volumetric data.
NPAR '11 Best Paper Award in Rendering Non-photorealistic Animation and Rendering (NPAR '11), 2011
@inproceedings{Kainz:2011:SRP:2024676.2024685,
author = {Kainz, Bernhard and Steinberger, Markus and Hauswiesner, Stefan and Khlebnikov, Rostislav and Schmalstieg, Dieter},
title = {Stylization-based Ray Prioritization for Guaranteed Frame Rates},
booktitle = {Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Non-Photorealistic Animation and Rendering},
series = {NPAR '11},
year = {2011},
isbn = {978-1-4503-0907-3},
location = {Vancouver, British Columbia, Canada},
pages = {43--54},
numpages = {12},
url = {http://doi.acm.org/10.1145/2024676.2024685},
doi = {10.1145/2024676.2024685},
acmid = {2024685},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {guaranteed frame rates, line features, ray based image generation},
}
Abstract: In this paper we present importance-driven compositing window management, which considers windows not only as
basic rectangular shapes but also integrates the importance of the windows' content using a bottom-up visual attention model.
Based on this information, importance-driven compositing optimizes the spatial window layout for maximum visibility and interactivity
of occluded content in combination with see-through windows. We employ this technique for emerging window manager functions to minimize
information overlap caused by popping up windows or floating toolbars and to improve the access to occluded window content.
An initial user study indicates that our technique provides a more effective and satisfactory access to occluded information than
the well-adopted Alt+Tab window switching technique and see-through windows without optimized spatial layout.
CHI '11 Honorable Mention Award Human Factors in Computing Systems (CHI '11), 2011
@inproceedings{Waldner:2011:ICW:1978942.1979085,
author = {Waldner, Manuela and Steinberger, Markus and Grasset, Raphael and Schmalstieg, Dieter},
title = {Importance-driven Compositing Window Management},
booktitle = {Proceedings of the SIGCHI Conference on Human Factors in Computing Systems},
series = {CHI '11},
year = {2011},
isbn = {978-1-4503-0228-9},
location = {Vancouver, BC, Canada},
pages = {959--968},
numpages = {10},
url = {http://doi.acm.org/10.1145/1978942.1979085},
doi = {10.1145/1978942.1979085},
acmid = {1979085},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {compositing window management, space management, transparency, visual saliency},
}
Abstract: A common challenge in interactive image generation is maintaining high interactivity of the applications that
use computationally demanding rendering algorithms. This is usually achieved by sacrificing some of the image quality in order to
decrease the rendering time. Most of such algorithms achieve interactive frame rates while trying to preserve as much image quality
as possible by applying the reduction steps non-uniformly. However, high-end rendering systems, such as those presented by Parker et al. in 2010,
aim to generate highly realistic images of very complex scenes. In such systems ordinary sampling approaches often give visually unacceptable results.
In order to allow to optimize the ratio between the sampling rate of the scene and its resulting perceptual quality, we present a new sampling strategy
which uses the information about object features that are known to support the comprehension of 3D shape. We control the sampling
density by exploiting line extraction techniques commonly used in Non-Photorealistic Renderings.
Symposium on Interactive 3D Graphics and Games 2011 (I3D), 2011
@inproceedings{Kainz:2011:UPF:1944745.1944795,
author = {Kainz, B. and Steinberger, M. and Hauswiesner, S. and Khlebnikov, R. and Kalkofen, D. and Schmalstieg, D.},
title = {Using Perceptual Features to Prioritize Ray-based Image Generation},
booktitle = {Symposium on Interactive 3D Graphics and Games},
series = {I3D '11},
year = {2011},
isbn = {978-1-4503-0565-5},
location = {San Francisco, California},
pages = {215--215},
numpages = {1},
url = {http://doi.acm.org/10.1145/1944745.1944795},
doi = {10.1145/1944745.1944795},
acmid = {1944795},
publisher = {ACM},
address = {New York, NY, USA},
}
Abstract: We present an interactive rendering method for isosurfaces in a voxel grid. The underlying trivariate function
is represented as a spline wavelet hierarchy, which allows for adaptive (view-dependent) selection of the desired
level-of-detail by superimposing appropriately weighted basis functions. Different root finding techniques are compared
with respect to their precision and efficiency. Both wavelet reconstruction and root finding are implemented
in CUDA to utilize the high computational performance of Nvidia's hardware and to obtain high quality results.
We tested our methods with datasets of up to 5123 voxels and demonstrate interactive frame rates for a viewport
size of up to 1024x768 pixels.
Eurographics/IEEE VGTC Symposium on Volume Graphics, 2010
@inproceedings {VG:VG10:013-020,
booktitle = {IEEE/ EG Symposium on Volume Graphics},
editor = {Ruediger Westermann and Gordon Kindlmann},
title = {{Wavelet-based Multiresolution Isosurface Rendering}},
author = {Steinberger, Markus and Grabner, Markus},
year = {2010},
publisher = {The Eurographics Association},
ISSN = {1727-8376},
ISBN = {978-3-905674-23-1},
DOI = {10.2312/VG/VG10/013-020}
}
Abstract: In this paper we propose a new technique for isosurface
rendering of volume data. Medical data visualization, e.g.
relies on exact and interactive isosurface renderings. We
show how to construct a multi resolution view of the data
using bi-orthogonal spline wavelets and how to perform
fast rendering using raycasting implemented on the GPU.
This approach benefits from the properties of both: the
wavelet transform and the reconstruction using three di-
mensional splines. The smoothness of surfaces is zoom
level independent and data can be compressed to speed up
rendering while still being able to show full detail quickly.
Ray evaluation is implemented in model space to enable
perspective rendering. Due to the fact that the isosurface
is not extracted from the data beforehand, the isosurface
level as well as the current resolution level can be changed
without any further computation.
CESCG '09 3rd Best Paper Award Central European Seminar on Computer Graphics (CESCG '09), 2009
@inproceedings{steinbergermultiresolution09,
title={Multiresolution Isosurface Rendering},
author={Steinberger, Markus},
booktitle = {Proceedings of Central European Seminar on Computer Graphics (CESCG'09)},
year = {2009}
}