Publications

I have started publishing during my master's studies in 2009. Most of my publications are on GPU algorithms, GPU work scheduling and related topics with a focus on rendering. However, I am also publishing in the fields of visualization, information visualization, human computer interaction, and augmented reality. For a current list of citations see my Google Scholar page. In the following you will find my self-maintained list of publications with some additional information and material about them. If you have any questions about the individual publications do not hesitate to contact me.

Selection:

  • 2009

  • 2010

  • 2011

  • 2012

  • 2013

  • 2014

  • 2015

  • 2016

  • 2017

  • 2018

  • 2019

  • 2020

  • 2021

  • 2022

  • 2023

  • C38
    MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions
    Mathias Parger, Chengcheng Tang, Thomas Neff, Christopher D. Twigg, Cem Keskin, Robert Wang, Markus Steinberger:
    MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions
    Abstract: Convolutional neural network inference on video input is computationally expensive and requires high memory bandwidth. Recently, DeltaCNN managed to reduce the cost by only processing pixels with significant updates over the previous frame. However, DeltaCNN relies on static camera input. Moving cameras add new challenges in how to fuse newly unveiled image regions with already processed regions efficiently to minimize the update rate - without increasing memory overhead and without knowing the camera extrinsics of future frames. In this work, we propose MotionDeltaCNN, a sparse CNN inference framework that supports moving cameras. We introduce spherical buffers and padded convolutions to enable seamless fusion of newly unveiled regions and previously processed regions – without increasing memory footprint. Our evaluation shows that we outperform DeltaCNN by up to 90% for moving camera videos.
    Proceedings of the IEEE/CVF International on Computer Vision , 2023
  • J41
    Effect-based Multi-viewer Caching for Cloud-native Rendering
    Alexander Weinrauch, Wolfgang Tatzgern, Pascal Stadlbauer, Alexis Crickx, Jozef Hladky, Arno Coomans, Martin Winter, Joerg H. Mueller, Markus Steinberger:
    Effect-based Multi-viewer Caching for Cloud-native Rendering
    Abstract: With cloud computing becoming ubiquitous, it appears as virtually everything can be offered as-a-service. However, real-time rendering in the cloud forms a notable exception, where the cloud adoption stops at running individual game instances in compute centers. In this paper, we explore whether a cloud-native rendering architecture is viable and scales to multi-client rendering scenarios. To this end, we propose world-space and on-surface caches to share rendering computations among viewers placed in the same virtual world. We discuss how caches can be utilized on an effect-basis and demonstrate that a large amount of computations can be saved as the number of viewers in a scene increases. Caches can easily be set up for various effects, including ambient occlusion, direct illumination, and diffuse global illumination. Our results underline that the image quality using cached rendering is on par with screen-space rendering and due to its simplicity and inherent coherence, cached rendering may even have advantages in single viewer setups. Analyzing the runtime and communication costs, we show that cached rendering is already viable in multi-GPU systems. Building on top of our research, cloud-native rendering may be just around the corner.
    ACM Transactions on Graphics (SIGGRAPH '23), 2023
  • J40
    Trim Regions for Online Computation of From-Region Potentially Visible Sets
    Philip Voglreiter, Bernhard Kerbl, Alexander Weinrauch, Joerg H. Mueller, Thomas Neff, Markus Steinberger, Dieter Schmalstieg:
    Trim Regions for Online Computation of From-Region Potentially Visible Sets
    Abstract: Visibility computation is a key element in computer graphics applications. More specifically, a from-region potentially visible set (PVS) is an established tool in rendering acceleration, but its high computational cost means a from-region PVS is almost always precomputed. Precomputation restricts the use of PVS to static scenes and leads to high storage cost, in particular, if we need fine-grained regions. For dynamic applications, such as streaming content over a variable-bandwidth network, online PVS computation with configurable region size is required. We address this need with trim regions, a new method for generating from-region PVS for arbitrary scenes in real time. Trim regions perform controlled erosion of object silhouettes in image space, implicitly applying the shrinking theorem known from previous work. Our algorithm is the first that applies automatic shrinking to unconstrained 3D scenes, including non-manifold meshes, and does so in real time using an efficient GPU execution model. We demonstrate that our algorithm generates a tight PVS for complex scenes and outperforms previous online methods for from-viewpoint and from-region PVS. It runs at 60 Hz for realistic game scenes consisting of millions of triangles and computes PVS with a tightness matching or surpassing existing approaches.
    ACM Transactions on Graphics (SIGGRAPH '23), 2023
  • C37
    Surface Light Cones: Sharing Direct Illumination for Efficient Multi-viewer Rendering
    Pascal Stadlbauer, Alexander Weinrauch, Wolfgang Tatzgern, Markus Steinberger:
    Surface Light Cones: Sharing Direct Illumination for Efficient Multi-viewer Rendering
    Abstract: Even though stochastic methods and hardware supported ray tracing are increasingly used for computing direct illumination, the efficient real-time rendering of dynamic area light sources still forms a challenge. In this paper, we propose a method for representing and caching direct illumination information using a compact multi-cone representation that is stored on the surface of objects. While shading due to direct illumination is typically heavily view-dependent, the incoming radiance for surface points is view-independent. Relying on cones, to represent the projection of the dominant visible light sources, allows to reuse the incoming radiance information across frames and even among multiple cameras or viewers within the same scene. Progressively refining and updating the cone structures not only allows to adapt to dynamic scenes, but also leads to reduced noise levels in the output images compared to sampling based methods. Relying on surface light cones allows to render single viewer setups 2-3x faster than random sampling, and 1.5-2x faster than reservoir-based sampling with the same quality. The main selling point for surface light cones is multi-camera rendering, For stereo rendering, our approach essentially halves the time required for determining direct light visibility. For rendering in the cloud, where multiple viewers are positioned close to another, such as in virtual meetings, gathering locations in games, or online events such as virtual concerts, our approach can reduce overall rendering times by a factor of 20x for as few as 16 viewers in a scene compared to traditional light sampling. Finally, under heavily constraint ray budgets where noise levels typically overshadow bias, surface light cones can dramatically reduce noise.
    High-Performance Graphics - Symposium Papers (HPG'23), 2023
  • C36
    Efficient Rendering of Participating Media for Multiple Viewpoints
    Robert Stojanovic, Alexander Weinrauch, Wolfgang Tatzgern, Andreas Kurz, Markus Steinberger:
    Efficient Rendering of Participating Media for Multiple Viewpoints
    Abstract: Achieving realism in modern games requires the integration of participating media effects, such as fog, dust, and smoke. However, due to the complex nature of scattering and partial occlusions within these media, real-time rendering of high-quality participating media remains a computational challenge. To address this challenge, traditional approaches of real-time participating media rendering involve storing temporary results in a view-aligned grid before ray marching through these cached values. In this paper, we investigate alternative hybrid worldand view-aligned caching methods that allow for the sharing of intermediate computations across cameras in a scene. This approach is particularly relevant for multi-camera setups, such as stereo rendering for VR and AR, local split-screen games, or cloud-based rendering for game streaming, where a large number of players may be in the same location. Our approach relies on a view-aligned grid for near-field computations, which enables us to capture high-frequency shadows in front of a viewer. Additionally, we use a world-space caching structure to selectively activate distant computations based on each viewer's visibility, allowing for the sharing of computations and maintaining high visual quality. The results of our evaluation demonstrate computational savings of up to 50% or more, without compromising visual quality.
    High-Performance Graphics - Symposium Papers (HPG'23), 2023
  • C35
    PSAO: Point-Based Split Rendering for Ambient Occlusion
    Thomas Neff, Brian Budge, Zhao Dong, Dieter Schmalstieg, Markus Steinberger:
    PSAO: Point-Based Split Rendering for Ambient Occlusion
    Abstract: Recent advances in graphics hardware have enabled ray tracing to produce high-quality ambient occlusion (AO) in real-time, which is not plagued by the artifacts typically found in real-time screen-space approaches. However, the high computational cost of ray tracing remains a significant hurdle for low-power devices like standalone VR headsets or smartphones. To address this challenge, inspired by point-based global illumination and texture-space split rendering, we propose point-based split ambient occlusion (PSAO), a novel split-rendering system that streams points sparsely from server to client. PSAO first evenly distributes points across the scene, and then subsequently only transmits points that changed more than a given threshold, using an efficient hash grid to blend neighboring points for the final compositing pass on the client. PSAO outperforms recent texture-space shading approaches in terms of quality and required network bit rate, while demonstrating performance similar to commonly used lower-quality screen-space approaches. Our point-based split rendering representation lends itself to highly compressible signals such as AO and is scalable towards quality or bandwidth requirements by adjusting the number of points in the scene.
    High-Performance Graphics - Symposium Papers (HPG'23), 2023
  • C34
    Clouds in the Cloud: Efficient Cloud-Based Rendering of Real-Time Volumetric Clouds
    Alexander Weinrauch, Stephan Lorbek, Wolfgang Tatzgern, Pascal Stadlbauer, Markus Steinberger:
    Clouds in the Cloud: Efficient Cloud-Based Rendering of Real-Time Volumetric Clouds
    Abstract: Volumetric clouds play a crucial role in creating realistic, dynamic, and immersive virtual outdoor environments. However, rendering volumetric clouds in real-time presents a significant computational challenge on end-user devices. In this paper, we investigate the viability of moving computations to remote servers in the cloud and sharing them among many viewers in the same virtual world, without compromising the perceived quality of the final renderings. We propose an efficient rendering method for volumetric clouds and cloud shadows utilizing caches placed in the cloud layers and directly on the surface of objects. Volumetric cloud properties, like density and lightning, are cached on spheres positioned to represent cloud layers at varying heights. Volumetric cloud shadows are cached directly on the surfaces of receiving objects. This allows efficient rendering in scenarios where multiple viewers observe the same cloud formations by sharing redundant calculations and storing them over multiple frames. Due to the placement and structure of our caches, viewers on the ground still perceive plausible parallax under movement on the ground. In a user study, we found that viewers hardly perceive quality reductions even when computations are shared for viewers that are hundreds of meters apart. Due to the smoothness of the appearance of clouds, caching structures can use significantly reduced resolution and as such allow for efficient rendering even in single-viewer scenarios. Our quantitative experiments demonstrate computational cost savings proportional to the number of viewers placed in the scene when relying on our caches compared to traditional rendering.
    High-Performance Graphics - Symposium Papers (HPG'23), 2023
  • PA09
    Meshlet shading atlas
    Thomas Neff, Joerg H. Mueller, Markus Steinberger, Dieter Schmalstieg:
    Meshlet shading atlas
    Abstract: Aspects presented herein relate to methods and devices for graphics processing including an apparatus, e.g., a GPU. The apparatus may divide at least one scene into a plurality of meshlets, each of the meshlets including a plurality of primitives, and each of the primitives including plurality of vertices. The apparatus may also calculate a pair of texture coordinates for each of the plurality of vertices. Further, the apparatus may select a size of each of the plurality of meshlets in the at least one scene based on the pair of the texture coordinates and based on a perspective projection of each of the plurality of meshlets. The apparatus may also calculate layout information in a meshlet atlas for each of the meshlets in the at least one scene. Moreover, the apparatus may shade each of a plurality of pixels in the meshlet atlas based on the calculated layout information.
    US Patent App. 17/934,159, 2023
  • C33
    AnyQ: An Evaluation Framework for Massively-Parallel Queue Algorithms
    Michael Kenzel, Stefan Lemme, Richard Membarth, Matthias Kurtenacker, Hugo Devillers, Markus Steinberger, Philipp Slusallek:
    AnyQ: An Evaluation Framework for Massively-Parallel Queue Algorithms
    Abstract: Concurrent queue algorithms have been subject to extensive research. However, the target hardware and evaluation methodology on which the published results for any two given concurrent queue algorithms are based often share only minimal overlap. A meaningful comparison is, thus, exceedingly difficult. With the continuing trend towards more and more heterogeneous systems, it is becoming more and more important to not only evaluate and compare novel and existing queue algorithms across a wider range of target architectures, but to also be able to continuously re-evaluate queue algorithms in light of novel architectures and capabilities.To address this need, we present AnyQ, an evaluation framework for concurrent queue algorithms. We design a set of programming abstractions that enable the mapping of concurrent queue algorithms and benchmarks to a wide variety of target architectures. We demonstrate the effectiveness of these abstractions by showing that a queue algorithm expressed in a portable, high-level manner can achieve performance comparable to hand-crafted implementations. We design a system for testing and benchmarking queue algorithms. Using the developed framework, we investigate concurrent queue algorithm performance across a range of both CPU as well as GPU architectures. In hopes that it may serve the community as a starting point for building a common repository of concurrent queue algorithms as well as a base for future research, all code and data is made available as open source software at https://anydsl.github.io/anyq.
    IEEE International Parallel and Distributed Processing Symposium (IPDPS'23), 2023
  • J39
    A Variational Loop Shrinking Analogy for Handle and Tunnel Detection and Reeb Graph Construction on Surface
    Alexander Weinrauch, Daniel Mlakar, Hans-Peter Seidel, Markus Steinberger, Rhaleb Zayer:
    A Variational Loop Shrinking Analogy for Handle and Tunnel Detection and Reeb Graph Construction on Surface
    Abstract: The humble loop shrinking property played a central role in the inception of modern topology but it has been eclipsed by more abstract algebraic formalisms. This is particularly true in the context of detecting relevant non-contractible loops on surfaces where elaborate homological and/or graph theoretical constructs are favored in algorithmic solutions. In this work, we devise a variational analogy to the loop shrinking property and show that it yields a simple, intuitive, yet powerful solution allowing a streamlined treatment of the problem of handle and tunnel loop detection. Our formalization tracks the evolution of a diffusion front randomly initiated on a single location on the surface. Capitalizing on a diffuse interface representation combined with a set of rules for concurrent front interactions, we develop a dynamic data structure for tracking the evolution on the surface encoded as a sparse matrix which serves for performing both diffusion numerics and loop detection and acts as the workhorse of our fully parallel implementation. The substantiated results suggest our approach outperforms state of the art and robustly copes with highly detailed geometric models. As a byproduct, our approach can be used to construct Reeb graphs by diffusion thus avoiding commonly encountered issues when using Morse functions.
    Computer Graphics Forum (EG'23), 2023
  • J38
    QuadStream: A Quad-Based Scene Streaming Architecture for Novel Viewpoint Reconstruction
    Jozef Hladky, Michael Stengel, Nicholas Vining, Bernhard Kerbl, Hans-Peter Seidel, Markus Steinberger:
    QuadStream: A Quad-Based Scene Streaming Architecture for Novel Viewpoint Reconstruction
    Abstract: Streaming rendered 3D content over a network to a thin client device, such as a phone or a VR/AR headset, brings high-fidelity graphics to platforms where it would not normally possible due to thermal, power, or cost constraints. Streamed 3D content must be transmitted with a representation that is both robust to latency and potential network dropouts. Transmitting a video stream and reprojecting to correct for changing viewpoints fails in the presence of disocclusion events; streaming scene geometry and performing high-quality rendering on the client is not possible on limited-power mobile GPUs. To balance the competing goals of disocclusion robustness and minimal client workload, we introduce QuadStream, a new streaming content representation that reduces motion-to-photon latency by allowing clients to efficiently render novel views without artifacts caused by disocclusion events. Motivated by traditional macroblock approaches to video codec design, we decompose the scene seen from positions in a view cell into a series of quad proxies, or view-aligned quads from multiple views. By operating on a rasterized G-Buffer, our approach is independent of the representation used for the scene itself; the resulting QuadStream is an approximate geometric representation of the scene that can be reconstructed by a thin client to render both the current view and nearby adjacent views. Our technical contributions are an efficient parallel quad generation, merging, and packing strategy for proxy views covering potential client movement in a scene; a packing and encoding strategy that allows masked quads with depth information to be transmitted as a frame-coherent stream; and an efficient rendering approach for rendering our QuadStream representation into entirely novel views on thin clients. We show that our approach achieves superior quality compared both to video data streaming methods, and to geometry-based streaming.
    ACM Transactions on Graphics (SIGGRAPH Asia'22), 2022
  • C32
    AdaNeRF: Adaptive Sampling for Real-Time Rendering of Neural Radiance Fields
    Andreas Kurz, Thomas Neff, Zhaoyang Lv, Michael Zollhöfer, Markus Steinberger:
    AdaNeRF: Adaptive Sampling for Real-Time Rendering of Neural Radiance Fields
    Abstract: Novel view synthesis has recently been revolutionized by learning neural radiance fields directly from sparse observations. However, rendering images with this new paradigm is slow due to the fact that an accurate quadrature of the volume rendering equation requires a large number of samples for each ray. Previous work has mainly focused on speeding up the network evaluations that are associated with each sample point, e.g., via caching of radiance values into explicit spatial data structures, but this comes at the expense of model compactness. In this paper, we propose a novel dual-network architecture that takes an orthogonal direction by learning how to best reduce the number of required sample points. To this end, we split our network into a sampling and shading network that are jointly trained. Our training scheme employs fixed sample positions along each ray, and incrementally introduces sparsity throughout training to achieve high quality even at low sample counts. After fine-tuning with the target number of samples, the resulting compact neural representation can be rendered in real-time. Our experiments demonstrate that our approach outperforms concurrent compact neural representations in terms of quality and frame rate and performs on par with highly efficient hybrid representations.
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
  • PA08
    Methods and apparatus for order-independent occlusion computations
    Philip Voglreiter, Dieter Schmalstieg, Markus Steinberger:
    Methods and apparatus for order-independent occlusion computations
    Abstract: The present disclosure relates to methods and apparatus for graphics processing. The apparatus can determine geometry information for each of a plurality of primitives associated with a viewpoint in a scene. The apparatus can also calculate at least one of surface information and disocclusion information based on the geometry information for each of the plurality of primitives, where the surface information and the disocclusion information may be associated with a volumetric grid based on a viewing area corresponding to the viewpoint. Also, the apparatus can calculate visibility information for each of the plurality of primitives based on at least one of the surface information and the disocclusion information, where the visibility information may be associated with the volumetric grid. The apparatus can also determine whether each of the plurality of primitives is visible based on the visibility information for each of the plurality of primitives.
    US Patent 11,380,047, 2022
  • C31
    DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos
    Mathias Parger, Chengcheng Tang, Christopher D. Twigg, Cem Keskin, Robert Wang, Markus Steinberger:
    DeltaCNN: End-to-End CNN Inference of Sparse Frame Differences in Videos
    Abstract: Commonly used image-space layouts of shading points, such as used in deferred shading, are strictly view-dependent, which restricts efficient caching and temporal amortization. In contrast, texture-space layouts can represent shading on all surface points and can be tailored to the needs of a particular application. However, the best grouping of shading points—which we call a shading unit—in texture space remains unclear. Choices of shading unit granularity (how many primitives or pixels per unit) and in shading unit parametrization (how to assign texture coordinates to shading points) lead to different outcomes in terms of final image quality, overshading cost, and memory consumption. Among the possible choices, shading units consisting of larger groups of scene primitives, so-called meshlets, remain unexplored as of yet. In this paper, we introduce a taxonomy for analyzing existing texture-space shading methods based on the group size and parametrization of shading units. Furthermore, we introduce a novel texture-space layout strategy that operates on large shading units: the meshlet shading atlas. We experimentally demonstrate that the meshlet shading atlas outperforms previous approaches in terms of image quality, run-time performance and temporal upsampling for a given number of fragment shader invocations. The meshlet shading atlas lends itself to work together with popular cluster-based rendering of meshes with high geometric detail.
    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
  • J37
    Meshlets and How to Shade Them: A Study on Texture-Space Shading
    Thomas Neff, Joerg H. Mueller, Markus Steinberger, Dieter Schmalstieg:
    Meshlets and How to Shade Them: A Study on Texture-Space Shading
    Abstract: Commonly used image-space layouts of shading points, such as used in deferred shading, are strictly view-dependent, which restricts efficient caching and temporal amortization. In contrast, texture-space layouts can represent shading on all surface points and can be tailored to the needs of a particular application. However, the best grouping of shading points—which we call a shading unit—in texture space remains unclear. Choices of shading unit granularity (how many primitives or pixels per unit) and in shading unit parametrization (how to assign texture coordinates to shading points) lead to different outcomes in terms of final image quality, overshading cost, and memory consumption. Among the possible choices, shading units consisting of larger groups of scene primitives, so-called meshlets, remain unexplored as of yet. In this paper, we introduce a taxonomy for analyzing existing texture-space shading methods based on the group size and parametrization of shading units. Furthermore, we introduce a novel texture-space layout strategy that operates on large shading units: the meshlet shading atlas. We experimentally demonstrate that the meshlet shading atlas outperforms previous approaches in terms of image quality, run-time performance and temporal upsampling for a given number of fragment shader invocations. The meshlet shading atlas lends itself to work together with popular cluster-based rendering of meshes with high geometric detail.
    Computer Graphics Forum (EG'22), 2022
  • O05
    CUDA and Application to Task-Based Programming
    Bernhard Kerbl, Michael Kenzel, Martin Winter, Markus Steinberger:
    CUDA and Application to Task-Based Programming
    Outline: Since its inception, the CUDA programming model has been continuously evolving. Because the CUDA toolkit aims to consistently expose cutting-edge capabilities for general purpose compute jobs to its users, the added features in each new version reflect the rapid changes that we observe in GPU architectures. Over the years, the changes in hardware, a growing scope of built-in functions and libraries, as well as an advancing C++ standard compliance have expanded the design choices when coding for CUDA, and significantly altered the directives to achieve peak performance. In this tutorial, we give a thorough introduction to the CUDA toolkit, demonstrate how a contemporary application can benefit from recently introduced features and how they can be applied to task-based GPU scheduling in particular.
    To provide a profound understanding of how CUDA applications can achieve peak performance, Part 1 of this tutorial outlines the modern CUDA architecture. Following a basic introduction, we expose how language features are linked to – and constrained by – the underlying physical hardware components. Furthermore, we describe common applications for massively parallel programming, offer a detailed breakdown of potential issues and list ways to mitigate performance impacts. An examplary analysis of PTX and SASS snippets illustrates how code patterns in CUDA are mapped to actual hardware instructions.
    In Part 2, we will focus on novel features that were enabled by the arrival of CUDA 10+ toolkits and the Volta+ architectures, such as ITS, tensor cores and the graph API. In addition to basic use case demonstrations, we outline our own experiences with these capabilities and their potential performance benefits. We also discuss how long-standing best practices are affected by these changes and describe common caveats for dealing with legacy code on recent GPU models. We show how these considerations can be implemented in practice by presenting state-of-the-art research into task-based GPU scheduling, and how the dynamic adjustment of thread roles and group configurations can significantly increase performance.
    Eurographics Tutorials'22, 2022
  • PA07
    Billboard layers in object-space rendering
    Dieter Schmalstieg, Markus Steinberger, Wolfgang Tatzgern:
    Billboard layers in object-space rendering
    Abstract: The present disclosure relates to methods and apparatus for graphics processing. The apparatus may configure a plurality of billboards associated with a viewpoint of a first frame of a plurality of frames, the plurality of billboards being configured in one or more layers at least partially around the viewpoint, the configuration of the plurality of billboards being based on one or more volumetric elements between at least one of the plurality of billboards and the viewpoint. The apparatus may also render an image associated with each of the one or more volumetric elements between at least one billboard of the plurality of billboards and the viewpoint, the rendered image including a set of pixels. The apparatus may also store data in the at least one billboard based on the rendered image associated with each of the one or more volumetric elements, the data corresponding to the set of pixels.
    US Patent App. 17/400,031, 2022
  • PA06
    Image-space function transmission
    Dieter Schmalstieg, Pascal Stadlbauer, Markus Steinberger:
    Image-space function transmission
    Abstract: The present disclosure relates to methods and apparatus for graphics processing at a server and/or a client device. In some aspects, the apparatus may convert application data for at least one frame, the application data corresponding to one or more image functions or one or more data channels. The apparatus may also encode the application data for the at least one frame, the application data being associated with a data stream, the application data being encoded via a video encoding process. The apparatus may also transmit the encoded application data for the at least one frame. Additionally, the apparatus may receive application data for at least one frame, the application data being associated with a data stream. The apparatus may also decode the application data for the at least one frame; and convert the application data for the at least one frame.
    US Patent App. 17/400,048, 2022
  • PA05
    Compressed geometry rendering and streaming
    Dieter Schmalstieg, Markus Steinberger, Daniel Mlakar:
    Compressed geometry rendering and streaming
    Abstract: The present disclosure relates to methods and apparatus for graphics processing. The apparatus may identify at least one mesh associated with at least one frame. The apparatus may also divide the at least one mesh into a plurality of groups of primitives, each of the plurality of groups of primitives including at least one primitive and a plurality of vertices. The apparatus may also compress the plurality of groups of primitives into a plurality of groups of compressed primitives, the plurality of groups of compressed primitives being associated with random access. Additionally, the apparatus may decompress the plurality of groups of compressed primitives, at least one first group of the plurality of groups of compressed primitives being decompressed in parallel with at least one second group of the plurality of groups of compressed primitives.
    US Patent App. 17/400,065, 2022
  • C30
    Are van Emde Boas trees viable on the GPU?
    Benedikt Mayr, Alexander Weinrauch, Mathias Parger, Markus Steinberger:
    Are van Emde Boas trees viable on the GPU?
    Abstract: Van Emde Boas trees show an asymptotic query complexity surpassing the performance of traditional data structure for performing search queries in large data sets. However, their implementation and large constant overheads prohibit their widespread use. In this work, we ask, whether van Emde Boas trees are viable on the GPU. We presents a novel algorithm to construct a van Emde Boas tree utilizing the parallel compute power and memory bandwidth of modern GPUs. By analyzing the structure of a sorted data set, our method is able to build a van Emde Boas tree efficiently in parallel with little thread synchronization. We compare querying data using a van Emde Boas tree and binary search on the GPU and show that for very large data sets, the van Emde Boas tree outperforms a binary search by up to 1.2x while similarly increasing space requirements. Overall, we confirm that van Emde Boas trees are viable for certain scenarios on the GPU.
    IEEE High Performance Extreme Computing, 2021
  • J36
    DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks
    Thomas Neff, Pascal Stadlbauer, Mathias Parger, Andreas Kurz, Joerg H Mueller, Chakravarty R. A. Chaitanya, Anton Kaplanyan, Markus Steinberger:
    DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks
    Abstract: The recent research explosion around Neural Radiance Fields (NeRFs) shows that there is immense potential for implicitly storing scene and lighting information in neural networks, e.g., for novel view generation. However, one major limitation preventing the widespread use of NeRFs is the prohibitive computational cost of excessive network evaluations along each view ray, requiring dozens of petaFLOPS when aiming for real-time rendering on current devices. We show that the number of samples required for each view ray can be significantly reduced when local samples are placed around surfaces in the scene. To this end, we propose a depth oracle network, which predicts ray sample locations for each view ray with a single network evaluation. We show that using a classification network around logarithmically discretized and spherically warped depth values is essential to encode surface locations rather than directly estimating depth. The combination of these techniques leads to DONeRF, a dual network design with a depth oracle network as a first step and a locally sampled shading network for ray accumulation. With our design, we reduce the inference costs by up to 48x compared to NeRF. Using an off-the-shelf inference API in combination with simple compute kernels, we are the first to render raymarching-based neural representations at interactive frame rates (15 frames per second at 800x800) on a single GPU. At the same time, since we focus on the important parts of the scene around surfaces, we achieve equal or better quality compared to NeRF.
    Computer Graphics Forum (EGSR'21), 2021
  • J35
    UNOC: Understanding Occlusion for Embodied Presence in Virtual Reality
    Mathias Parger, Chengcheng Tang, Yuanlu Xu, Christopher D. Twigg, Lingling Tao, Yijing Li, Rob Wang, Markus Steinberger:
    UNOC: Understanding Occlusion for Embodied Presence in Virtual Reality
    Abstract: Tracking body and hand motions in 3D space is essential for social and self-presence in augmented and virtual environments. Unlike the popular 3D pose estimation setting, the problem is often formulated as egocentric tracking based on embodied perception (e.g., egocentric cameras, handheld sensors). In this article, we propose a new data-driven framework for egocentric body tracking, targeting challenges of omnipresent occlusions in optimization-based methods (e.g., inverse kinematics solvers). We first collect a large-scale motion capture dataset with both body and finger motions using optical markers and inertial sensors. This dataset focuses on social scenarios and captures ground truth poses under self-occlusions and body-hand interactions. We then simulate the occlusion patterns in head-mounted camera views on the captured ground truth using a ray casting algorithm and learn a deep neural network to infer the occluded body parts. Our experiments show that our method is able to generate high-fidelity embodied poses by applying the proposed method to the task of real-time egocentric body tracking, finger motion synthesis, and 3-point inverse kinematics.
    IEEE Transactions on Visualization and Computer Graphics, 2021
  • C29
    Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures
    Daniel Mlakar, Martin Winter, Mathias Parger, Markus Steinberger:
    Speculative Parallel Reverse Cuthill-McKee Reordering on Multi- and Many-core Architectures
    Abstract: Bandwidth reduction of sparse matrices is used to reduce fill-in of linear solvers and to increase performance of other sparse matrix operations, e.g., sparse matrix vector multiplication in iterative solvers. To compute a bandwidth reducing permutation, Reverse Cuthill-McKee (RCM) reordering is often applied, which is challenging to parallelize, as its core is inherently serial. As many-core architectures, like the GPU, offer subpar single-threading performance and are typically only connected to high-performance CPU cores via a slow memory bus, neither computing RCM on the GPU nor moving the data to the CPU are viable options. Nevertheless, reordering matrices, potentially multiple times in-between operations, might be essential for high throughput. Still, to the best of our knowledge, we are the first to propose an RCM implementation that can execute on multicore CPUs and many-core GPUs alike, moving the computation to the data rather than vice versa. Our algorithm parallelizes RCM into mostly independent batches of nodes. For every batch, a single CPU-thread/a GPU thread-block speculatively discovers child nodes and sorts them according to the RCM algorithm. Before writing their permutation, we re-evaluate the discovery and build new batches. To increase parallelism and reduce dependencies, we create a signaling chain along successive batches and introduce early signaling conditions. In combination with a parallel work queue, new batches are started in order and the resulting RCM permutation is identical to the ground-truth single-threaded algorithm. We propose the first RCM implementation that runs on the GPU. It achieves several orders of magnitude speed-up over NVIDIA’s single-threaded cuSolver RCM implementation and is significantly faster than previous parallel CPU approaches. Our results are especially significant for many-core architectures, as it is now possible to include RCM reordering into sequences of sparse matrix operations without major performance loss.
    IEEE International Parallel and Distributed Processing Symposium (IPDPS'21), 2021
  • O04
    CUDA and Application to Task-Based Programming
    Michael Kenzel, Bernhard Kerbl, Martin Winter, Markus Steinberger:
    CUDA and Application to Task-Based Programming
    Outline: Since its inception, the CUDA programming model has been continuously evolving. Because the CUDA toolkit aims to consistently expose cutting-edge capabilities for general purpose compute jobs to its users, the added features in each new version reflect the rapid changes that we observe in GPU architectures. Over the years, the changes in hardware, a growing scope of built-in functions and libraries, as well as an advancing C++ standard compliance have expanded the design choices when coding for CUDA, and significantly altered the directives to achieve peak performance. In this tutorial, we give a thorough introduction to the CUDA toolkit, demonstrate how a contemporary application can benefit from recently introduced features and how they can be applied to task-based GPU scheduling in particular.
    To provide a profound understanding of how CUDA applications can achieve peak performance, Part 1 of this tutorial outlines the modern CUDA architecture. Following a basic introduction, we expose how language features are linked to – and constrained by – the underlying physical hardware components. Furthermore, we describe common applications for massively parallel programming, offer a detailed breakdown of potential issues and list ways to mitigate performance impacts. An examplary analysis of PTX and SASS snippets illustrates how code patterns in CUDA are mapped to actual hardware instructions.
    In Part 2, we will focus on novel features that were enabled by the arrival of CUDA 10+ toolkits and the Volta+ architectures, such as ITS, tensor cores and the graph API. In addition to basic use case demonstrations, we outline our own experiences with these capabilities and their potential performance benefits. We also discuss how long-standing best practices are affected by these changes and describe common caveats for dealing with legacy code on recent GPU models. We show how these considerations can be implemented in practice by presenting state-of-the-art research into task-based GPU scheduling, and how the dynamic adjustment of thread roles and group configurations can significantly increase performance.
    Eurographics Tutorials'21, 2021
  • J34
    SnakeBinning: Efficient Temporally Coherent Triangle Packing for Shading Streaming
    Jozef Hladky, Hans-Peter Seidel, Markus Steinberger:
    SnakeBinning: Efficient Temporally Coherent Triangle Packing for Shading Streaming
    Abstract: Streaming rendering, e.g., rendering in the cloud and streaming via a mobile connection, suffers from increased latency and unreliable connections. High quality framerate upsampling can hide these issues, especially when capturing shading into an atlas and transmitting it alongside geometric information. The captured shading information must consider triangle footprints and temporal stability to ensure efficient video encoding. Previous approaches only consider either temporal stability or sample distributions, but none focuses on both. With SnakeBinning, we present an efficient triangle packing approach that adjusts sample distributions and caters for temporal coherence. Using a multi-dimensional binning approach, we enforce tight packing among triangles while creating optimal sample distributions. Our binning is built on top of hardware supported real-time rendering where bins are mapped to individual pixels in a virtual framebuffer. Fragment shader interlock and atomic operations enforce global ordering of triangles within each bin, and thus temporal coherence according to the primitive order is achieved. Resampling the bin distribution guarantees high occupancy among all bins and a dense atlas packing. Shading samples are directly captured into the atlas using a rasterization pass, adjusting samples for perspective effects and creating a tight packing. Comparison to previous atlas packing approaches shows that our approach is faster than previous work and achieves the best sample distributions while maintaining temporal coherence. In this way, SnakeBinning achieves the highest rendering quality under equal atlas memory requirements. At the same time, its temporal coherence ensures that we require equal or less bandwidth than previous state-of-the-art. As SnakeBinning outperforms previous approach in all relevant aspects, it is the preferred choice for texture-based streaming rendering.
    Computer Graphics Forum (Eurographics'21), 2021
  • J33
    Temporally Adaptive Shading Reuse for Real-Time Rendering and Virtual Reality
    Joerg H. Mueller, Thomas Neff, Philip Voglreiter, Markus Steinberger, Dieter Schmalstieg:
    Temporally Adaptive Shading Reuse for Real-Time Rendering and Virtual Reality
    Abstract: Temporal coherence has the potential to enable a huge reduction of shading costs in rendering. Existing techniques focus either only on spatial shading reuse or cannot adaptively choose temporal shading frequencies. We find that temporal shading reuse is possible for extended periods of time for a majority of samples, and we show under which circumstances users perceive temporal artifacts. Our analysis implies that we can approximate shading gradients to efficiently determine when and how long shading can be reused. Whereas visibility usually stays temporally coherent from frame to frame for more than 90%, we find that even in heavily animated game scenes with advanced shading, typically more than 50% of shading is also temporally coherent. To exploit this potential, we introduce a temporally adaptive shading framework and apply it to two real-time methods. Its application saves more than 57% of the shader invocations, reducing overall rendering times up to in virtual reality applications without a noticeable loss in visual quality. Overall, our work shows that there is significantly more potential for shading reuse than currently exploited.
    ACM Transaction on Graphics (TOG)
    Presented at SIGGRAPH 2021, 2021

  • C28
    Are dynamic memory managers on GPUs slow?: a survey and benchmarks
    Martin Winter, Mathias Parger, Daniel Mlakar, Markus Steinberger:
    Are dynamic memory managers on GPUs slow?: a survey and benchmarks
    Abstract: Dynamic memory management on GPUs is generally understood to be a challenging topic. On current GPUs, hundreds of thousands of threads might concurrently allocate new memory or free previously allocated memory. This leads to problems with thread contention, synchronization overhead and fragmentation. Various approaches have been proposed in the last ten years and we set out to evaluate them on a level playing field on modern hardware to answer the question, if dynamic memory managers are as slow as commonly thought of. In this survey paper, we provide a consistent framework to evaluate all publicly available memory managers in a large set of scenarios. We summarize each approach and thoroughly evaluate allocation performance (thread-based as well as warp-based), and look at performance scaling, fragmentation and real-world performance considering a synthetic workload as well as updating dynamic graphs. We discuss the strengths and weaknesses of each approach and provide guidelines for the respective best usage scenario. We provide a unified interface to integrate any of the tested memory managers into an application and switch between them for benchmarking purposes. Given our results, we can dispel some of the dread associated with dynamic memory managers on the GPU.
    ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP'21), 2021
  • PA04
    Methods and apparatus for improving subpixel visibility
    Mark Dokter, Dieter Schmalstieg, Markus Steinberger:
    Methods and apparatus for improving subpixel visibility
    Abstract: The present disclosure relates to methods and devices for operation of a GPU. The device can determine a first subset of primitives associated with a set of objects within an image. The first subset of primitives can be based on a first viewpoint with respect to the set of objects. The device can also determine, for a second viewpoint with respect to the set of objects, a second subset of primitives excluding the first subset of primitives. In some aspects, the second subset of primitives can have a difference in depth with respect to the first subset of primitives that is less than a threshold depth. Additionally, the device can mark the first subset of primitives and the second subset of primitives as visible. Further, the device can generate graphical content based on the marked first subset of primitives and the marked second subset of primitives.
    US Patent app., 2020
  • PA03
    Real-time potentially visible set for streaming rendering
    Jozef Hladky, Markus Steinberger, Hans-Peter Seidel:
    Real-time potentially visible set for streaming rendering
    Abstract: The invention relates to a method for transmitting 3D model data, the 3D model data comprising polygons, from a server to a client for rendering, the method comprising: obtaining the 3D model data by the server; and transmitting the 3D model data from the server to the client. According to the invention, the 3D model data is obtained by the server, based on a given multitude of possible views.
    US Patent app., 2020
  • C27
    Ouroboros: virtualized queues for dynamic memory management on GPUs
    Martin Winter, Daniel Mlakar, Mathias Parger, Markus Steinberger:
    Ouroboros: virtualized queues for dynamic memory management on GPUs
    Abstract: Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue.
    In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting.
    Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.
    International Conference on Supercomputing (ICS'20), 2020
  • C26
    Fast Multi-View Rendering for Real-Time Applications
    Johannes Unterguggenberger, Bernhard Kerbl, Markus Steinberger, Dieter Schmalstieg, Michael Wimmer:
    Fast Multi-View Rendering for Real-Time Applications
    Abstract: Efficient rendering of multiple views can be a critical performance factor for real-time rendering applications. Generating more than one view multiplies the amount of rendered geometry, which can cause a huge performance impact. Minimizing that impact has been a target of previous research and GPU manufacturers, who have started to equip devices with dedicated acceleration units. However, vendor-specific acceleration is not the only option to increase multi-view rendering (MVR) performance. Available graphics API features, shader stages and optimizations can be exploited for improved MVR performance, while generally offering more versatile pipeline configurations, including the preservation of custom tessellation and geometry shaders. In this paper, we present an exhaustive evaluation of MVR pipelines available on modern GPUs. We provide a detailed analysis of previous techniques, hardware-accelerated MVR and propose a novel method, leading to the creation of an MVR catalogue. Our analyses cover three distinct applications to help gain clarity on overall MVR performance characteristics. Our interpretation of the observed results provides a guideline for selecting the most appropriate one for various use cases on different GPU architectures.
    Eurographics Symposium on Parallel Graphics and Visualization (EGPGV ‘20), 2020
  • J32
    Interactive Modeling of Cellular Structures on Surfaces with Application to Additive Manufacturing
    Pascal Stadlbauer, Daniel Mlakar, Hans-Peter Seidel, Markus Steinberger, Rhaleb Zayer:
    Interactive Modeling of Cellular Structures on Surfaces with Application to Additive Manufacturing
    Abstract: The rich and evocative patterns of natural tessellations endow them with an unmistakable artistic appeal and structural properties which are echoed across design, production, and manufacturing. Unfortunately, interactive control of such patterns-as modeled by Voronoi diagrams, is limited to the simple two dimensional case and does not extend well to freeform surfaces. We present an approach for direct modeling and editing of such cellular structures on surface meshes. The overall modeling experience is driven by a set of editing primitives which are efficiently implemented on graphics hardware. We feature a novel application for 3D printing on modern support-free additive manufacturing platforms. Our method decomposes the input surface into a cellular skeletal structure which hosts a set of overlay shells. In this way, material saving can be channeled to the shells while structural stability is channeled to the skeleton. To accommodate the available printer build volume, the cellular structure can be further split into moderately sized parts. Together with shells, they can be conveniently packed to save on production time. The assembly of the printed parts is streamlined by a part numbering scheme which respects the geometric layout of the input model.
    Computer Graphics Forum / Eurographics (EG'20), 2020
  • J31
    Subdivision-Specialized Linear Algebra Kernels for Static and Dynamic Mesh Connectivity on the GPU
    Daniel Mlakar, Martin Winter, Pascal Stadlbauer, Hans-Peter Seidel, Markus Steinberger, Rhaleb Zayer:
    Subdivision-Specialized Linear Algebra Kernels for Static and Dynamic Mesh Connectivity on the GPU
    Abstract: Subdivision surfaces have become an invaluable asset in production environments. While progress over the last years has allowed the use of graphics hardware to meet performance demands during animation and rendering, high-performance is limited to immutable mesh connectivity scenarios. Motivated by recent progress in mesh data structures, we show how the complete Catmull-Clark subdivision scheme can be abstracted in the language of linear algebra. While this high-level formulation allows for a fully parallel implementation with significant performance gains, the underlying algebraic operations require further specialization for modern parallel hardware. Integrating domain knowledge about the mesh matrix data structure, we replace costly general linear algebra operations like matrix-matrix multiplication by specialized kernels. By further considering innate properties of Catmull-Clark subdivision, like the quad-only structure after refinement, we achieve an additional order of magnitude in performance and significantly reduce memory footprints. Our approach can be adapted seamlessly for different use cases, such as regular subdivision of dynamic meshes, fast evaluation for immutable topology and feature-adaptive subdivision for efficient rendering of animated models. In this way, patchwork solutions are avoided in favor of a streamlined solution with consistent performance gains throughout the production pipeline. The versatility of the sparse matrix linear algebra abstraction underlying our work is further demonstrated by extension to other schemes such as √3 and Loop subdivision.
    Eurographics '20 Best Paper Award
    Computer Graphics Forum / Eurographics (EG'20), 2020

  • C25
    Stochastic Substitute Trees for Real-Time Global Illumination
    Wolfgang Tatzgern, Benedikt Mayr, Bernhard Kerbl, Markus Steinberger:
    Stochastic Substitute Trees for Real-Time Global Illumination
    Abstract: With the introduction of hardware-supported ray tracing and deep learning for denoising, computer graphics has made a considerable step toward real-time global illumination. In this work, we present an alternative global illumination method: The stochastic substitute tree (SST), a hierarchical structure inspired by lightcuts with light probability distributions as inner nodes. Our approach distributes virtual point lights (VPLs) in every frame and efficiently constructs the SST over those lights by clustering according to Morton codes. Global illumination is approximated by sampling the SST and considers the BRDF at the hit location as well as the SST nodes’ intensities for importance sampling directly from inner nodes of the tree. To remove the introduced Monte Carlo noise, we use a recurrent autoencoder. In combination with temporal filtering, we deliver real-time global illumination for complex scenes with challenging light distributions.
    Symposium on Interactive 3D Graphics and Games (I3D '20), 2020
  • C24
    spECK: Accelerating GPU Sparse Matrix-Matrix Multiplication through Lightweight Analysis
    Mathias Parger, Martin Winter, Daniel Mlakar, Markus Steinberger:
    spECK: Accelerating GPU Sparse Matrix-Matrix Multiplication through Lightweight Analysis
    Abstract: Sparse general matrix-matrix multiplication on GPUs is challenging due to the varying sparsity patterns of sparse matrices. Existing solutions achieve good performance for certain types of matrices, but fail to accelerate all kinds of matrices in the same manner. Our approach combines multiple strategies with dynamic parameter selection to dynamically choose and tune the best fitting algorithm for each row of the matrix. This choice is supported by a lightweight, multi-level matrix analysis, which carefully balances analysis cost and expected performance gains. Our evaluation on thousands of matrices with various characteristics shows that we outperform all currently available solutions in 79% over all matrices with >15k products and that we achieve the second best performance in 15%. For these matrices, our solution is on average 83% faster than the second best approach and up to 25X faster than other state-of-the-art GPU implementations. Using our approach, applications can expect great performance independent of the matrices they work on.
    Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '20), 2020
  • J30
    The Camera Offset Space: Real-time Potentially Visible SetComputations for Streaming Rendering
    Jozef Hladky, Hans-Peter Seidel, Markus Steinberger:
    The Camera Offset Space: Real-time Potentially Visible SetComputations for Streaming Rendering
    Abstract: Potential visibility has historically always been of importance when rendering performance was insufficient. With the rise of virtual reality, rendering power may once again be insufficient, e.g., for integrated graphics of head-mounted displays. To tackle the issue of efficient potential visibility computations on modern graphics hardware, we introduce the camera offset space (COS). Opposite to how traditional visibility computations work---where one determines which pixels are covered by an object under all potential viewpoints---the COS describes under which camera movement a sample location is covered by a triangle. In this way, the COS opens up a new set of possibilities for visibility computations. By evaluating the pairwise relations of triangles in the COS, we show how to efficiently determine occluded triangles. Constructing the COS for all pixels of a rendered view leads to a complete potentially visible set (PVS) for complex scenes. By fusing triangles to larger occluders, including locations between pixel centers, and considering camera rotations, we describe an exact PVS algorithm that includes all viewing directions inside a view cell. Implementing the COS is a combination of real-time rendering and compute steps. We provide the first GPU PVS implementation that works without preprocessing, on-the-fly, on unconnected triangles. This opens the door to a new approach of rendering for virtual reality head-mounted displays and server-client settings for streaming 3D applications such as video games.
    ACM Transactions on Graphics (SIGGRAPH Asia'19), 2019
  • C23
    Breadth-First Search on Dynamic Graphs using Dynamic Parallelism on the GPU
    Dominik Tödling, Martin Winter, Markus Steinberger:
    Breadth-First Search on Dynamic Graphs using Dynamic Parallelism on the GPU
    Abstract: Breadth-First Search is an important basis for many different graph-based algorithms with applications ranging from peer-to-peer networking to garbage collection. However, the performance of different approaches depends strongly on the type of graph. In this paper, we present an efficient algorithm that performs well on a variety of different graphs. As part of this, we look into utilizing dynamic parallelism in order to both reduce overhead from latency between the CPU and GPU, as well as speed up the algorithm itself. Lastly, integrate the algorithm with the faimGraph framework for dynamic graphs and examine the relative performance to a Compressed-Sparse-Row data structure. We show that our algorithm can be well adapted to the dynamic setting and outperforms another competing dynamic graph framework on our test set.
    High Performance Extreme Computing, 2019
  • O03
    Shading Atlas Streaming Demonstration
    Joerg H. Mueller, Thomas Neff, Philip Voglreiter, Mina Makar, Markus Steinberger, Dieter Schmalstieg:
    Shading Atlas Streaming Demonstration
    Abstract: Streaming high quality rendering for virtual reality applications requires minimizing perceived latency. Shading Atlas Streaming (SAS) [Mueller et al. 2018] is a novel object-space rendering framework suitable for streaming virtual reality content. SAS decouples server-side shading from client-side rendering, allowing the client to perform framerate upsampling and latency compensation autonomously for short periods of time. The shading information created by the server in object space is temporally coherent and can be efficiently compressed using standard MPEG encoding. SAS compares favorably to previous methods for remote image-based rendering in terms of image quality and network bandwidth efficiency. SAS allows highly efficient parallel allocation in a virtualized-texture-like memory hierarchy, solving a common efficiency problem of object-space shading. With SAS, untethered virtual reality headsets can benefit from high quality rendering without paying in increased latency. Visitors will be able to try SAS by roaming the exhibit area wearing a Snapdragon 845 based headset connected via consumer WiFi.
    ACM SIGGRAPH 2019 Emerging Technologies, 2019
  • J29
    Tessellated Shading Streaming
    Jozef Hladky, Hans-Peter Seidel, Markus Steinberger:
    Tessellated Shading Streaming
    Abstract: Presenting high-fidelity 3D content on compact portable devices with low computational power is challenging. Smartphones, tablets and head-mounted displays (HMDs) suffer from thermal and battery-life constraints and thus cannot match the render quality of desktop PCs and laptops. Streaming rendering enables to show high-quality content but can suffer from potentially high latency. We propose an approach to efficiently capture shading samples in object space and packing them into a texture. Streaming this texture to the client, we support temporal frame up-sampling with high fidelity, low latency and high mobility. We introduce two novel sample distribution strategies and a novel triangle representation in the shading atlas space. Since such a system requires dynamic parallelism, we propose an implementation exploiting the power of hardware-accelerated tessellation stages. Our approach allows fast de-coding and rendering of extrapolated views on a client device by using hardwareaccelerated interpolation between shading samples and a set of potentially visible geometry. A comparison to existing shading methods shows that our sample distributions allow better client shading quality than previous atlas streaming approaches and outperforms image-based methods in all relevant aspects.
    Computer Graphics Forum / Eurographics Symposium on Rendering (EGSR'19), 2019
  • PA02
    Accelerated occlusion computation
    Dieter Schmalstieg, Markus Steinberger, Philip Voglreiter:
    Accelerated occlusion computation
    Abstract: A system utilizing specified occlusion techniques to reduce the overall amount of occlusion computations required to generate an image in a changing viewpoint environment. Viewpoint location changes about a current viewpoint can result in a different image view with the potential exposure of previously occluded portions of a scene image that would now be now visible from that new viewpoint location. To reduce the amount of occlusion computations required to render an associated image in a changing viewpoint environment, techniques are described that reduce occlusion computations using, for example, one or more trim region rendering techniques. Some techniques include generating potential visible set (PVS) information based on a viewport including an original viewpoint location and an anticipated change in viewpoint location.
    US Patent App. 15/867,401, 2019
  • P03
    From Ground to Space: Real-time Rendering of Procedural Planets at Arbitrary Altitudes
    Florian Michelic, Michael Kenzel, Karl Haubenwallner, Markus Steinberger, Bernhard Kerbl:
    From Ground to Space: Real-time Rendering of Procedural Planets at Arbitrary Altitudes
    Abstract: As virtual 3D worlds continue to expand in size, a growing number of exploratory games and simulations aim to visualize entire planets and galaxies in real-time. However, consistent rendering of the whole planetary surface at varying levels of detail poses a number of challenges. As the viewer transitions from observing a planet from a distance to the ground level, fine-granular details manifest in its terrain geometry. For presenting these varying amounts of detail, a suitable solution must consider how the geometry should behave to ensure seamless accommodation for all frames of reference. Instead of storing a complete model of the planetary surface, more efficient alternatives employ structured base meshes of limited complexity, which can then be enhanced with elevation data from a height field. Based on this concept, we propose a new method for applying terrain data to spherical surfaces of arbitrary scale. Our approach uses a uniform grid that is mapped to a paraboloid and can be offset to influence the distribution of height field samples on the sphere. Starting from its initial layout, we show how the grid can be altered at runtime to avoid a variety of notorious artifacts during camera motion. We further derive suitable grid offset parameters to ensure correct visibility of distant objects behind the horizon. By decoupling the mesh geometry from the height field data and display resolution, the overhead of our method can be bounded and is subject only to the grid configuration. Compared to previous approaches, our solution is easy to implement, requires no preprocessing and achieves reduced runtime for rendering scenes at a similar quality.
    I3D 2019 Best Poster Award
    Symposium on Interactive 3D Graphics and Games 2011 (I3D), 2019

  • J28
    Hierarchical Rasterization of Curved Primitives for Vector Graphics Rendering on the GPU
    Mark Dokter, Jozef Hladky, Mathias Parger, Dieter Schmalstieg, Hans-Peter Seidel, Markus Steinberger:
    Hierarchical Rasterization of Curved Primitives for Vector Graphics Rendering on the GPU
    Abstract: In this paper, we introduce the CPatch, a curved primitive that can be used to construct arbitrary vector graphics. A CPatch is a generalization of a 2D polygon: Any number of curves up to a cubic degree bound a primitive. We show that a CPatch can be rasterized efficiently in a hierarchical manner on the GPU, locally discarding irrelevant portions of the curves. Our rasterizer is fast and scalable, works on all patches in parallel, and does not require any approximations. We show a parallel implementation of our rasterizer, which naturally supports all kinds of color spaces, blending and super‐sampling. Additionally, we show how vector graphics input can efficiently be converted to a CPatch representation, solving challenges like patch self intersections and false inside‐outside classification. Results indicate that our approach is faster than the state-of-the-art, more flexible and could potentially be implemented in hardware.
    Computer Graphics Forum / Eurographics (EG'19), 2019
  • C22
    Adaptive sparse matrix-matrix multiplication on the GPU
    Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, Markus Steinberger:
    Adaptive sparse matrix-matrix multiplication on the GPU
    Abstract: In the ongoing efforts targeting the vectorization of linear algebra primitives, sparse matrix-matrix multiplication (SpGEMM) has received considerably less attention than sparse Matrix-Vector multiplication (SpMV). While both are equally important, this disparity can be attributed mainly to the additional formidable challenges raised by SpGEMM.
    In this paper, we present a dynamic approach for addressing SpGEMM on the GPU. Our approach works directly on the standard compressed sparse rows (CSR) data format. In comparison to previous SpGEMM implementations, our approach guarantees a homogeneous, load-balanced access pattern to the first input matrix and improves memory access to the second input matrix. It adaptively re-purposes GPU threads during execution and maximizes the time efficient on-chip scratchpad memory can be used. Adhering to a completely deterministic scheduling pattern guarantees bit-stable results during repetitive execution, a property missing from other approaches. Evaluation on an extensive sparse matrix benchmark suggests our approach being the fastest SpGEMM implementation for highly sparse matrices (80% of the set). When bit-stable results are sought, our approach is the fastest across the entire test set.
    Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming, 2019
  • J27
    Layered Fields for Natural Tessellations on Surfaces
    Rhaleb Zayer, Daniel Mlakar, Markus Steinberger, Hans-Peter Seidel:
    Layered Fields for Natural Tessellations on Surfaces
    Abstract: Mimicking natural tessellation patterns is a fascinating multi-disciplinary problem. Geometric methods aiming at reproducing such partitions on surface meshes are commonly based on the Voronoi model and its variants, and are often faced with challenging issues such as metric estimation, geometric, topological complications, and most critically, parallelization. In this paper, we introduce an alternate model which may be of value for resolving these issues. We drop the assumption that regions need to be separated by lines. Instead, we regard region boundaries as narrow bands and we model the partition as a set of smooth functions layered over the surface. Given an initial set of seeds or regions, the partition emerges as the solution of a time dependent set of partial differential equations describing concurrently evolving fronts on the surface. Our solution does not require geodesic estimation, elaborate numerical solvers, or complicated bookkeeping data structures. The cost per time-iteration is dominated by the multiplication and addition of two sparse matrices. Extension of our approach in a Lloyd’s algorithm fashion can be easily achieved and the extraction of the dual mesh can be conveniently preformed in parallel through matrix algebra. As our approach relies mainly on basic linear algebra kernels, it lends itself to efficient implementation on modern graphics hardware.
    ACM Transactions on Graphics (SIGGRAPH Asia'18), 2018
  • J26
    Shading Atlas Streaming
    Joerg H. Mueller, Philip Voglreiter, Mark Dokter, Thomas Neff, Mina Makar, Markus Steinberger, Dieter Schmalstieg:
    Shading Atlas Streaming
    Abstract: Streaming high quality rendering for virtual reality applications requires minimizing perceived latency. We introduce Shading Atlas Streaming (SAS), a novel object-space rendering framework suitable for streaming virtual reality content. SAS decouples server-side shading from client-side rendering, allowing the client to perform framerate upsampling and latency compensation autonomously for short periods of time. The shading information created by the server in object space is temporally coherent and can be efficiently compressed using standard MPEG encoding. Our results show that SAS compares favorably to previous methods for remote image-based rendering in terms of image quality and network bandwidth efficiency. SAS allows highly efficient parallel allocation in a virtualized-texture-like memory hierarchy, solving a common efficiency problem of object-space shading. With SAS, untethered virtual reality headsets can benefit from high quality rendering without paying in increased latency.
    ACM Transactions on Graphics (SIGGRAPH Asia'18), 2018
  • C21
    Human Upper-Body Inverse Kinematics for Increased Embodiment in Consumer-Grade Virtual Reality
    Mathias Parger, Joerg H. Mueller, Dieter Schmalstieg, Markus Steinberger:
    Human Upper-Body Inverse Kinematics for Increased Embodiment in Consumer-Grade Virtual Reality
    Abstract: Having a virtual body can increase embodiment in virtual reality (VR) applications. However, comsumer-grade VR falls short of delivering sufficient sensory information for full-body motion capture. Consequently, most current VR applications do not even show arms, although they are often in the field of view. We address this shortcoming with a novel human upper-body inverse kinematics algorithm specifically targeted at tracking from head and hand sensors only. We present heuristics for elbow positioning depending on the shoulder-to-hand distance and for avoiding reaching unnatural joint limits. Our results show that our method increases the accuracy compared to general inverse kinematics applied to human arms with the same tracking input. In a user study, participants preferred our method over displaying disembodied hands without arms, but also over a more expensive motion capture system. In particular, our study shows that virtual arms animated with our inverse kinematics system can be used for applications involving heavy arm movement. We demonstrate that our method can not only be used to increase embodiment, but can also support interaction involving arms or shoulders, such as holding up a shield.
    Symposium on Virtual Reality Software and Technology (VRST ’18), 2018
  • C20
    faimGraph: High Performance Management of Fully-Dynamic Graphs under tight Memory Constraints on the GPU
    Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, Markus Steinberger:
    faimGraph: High Performance Management of Fully-Dynamic Graphs under tight Memory Constraints on the GPU
    Abstract: In this paper, we present a fully-dynamic graph data structure for the Graphics Processing Unit (GPU). It delivers high update rates while keeping a low memory footprint using autonomous memory management directly on the GPU. The data structure is fully-dynamic, allowing not only for edge but also vertex updates. Performing the memory management on the GPU allows for fast initialization times and efficient update procedures without additional intervention or reallocation procedures from the host. Our optimized approach performs initialization completely in parallel; up to 300x faster compared to previous work. It achieves up to 200 million edge updates per second for sorted and unsorted update batches; up to 30x faster than previous work. Furthermore, it can perform more than 300 million adjacency queries and millions of vertex updates per second. On account of efficient memory management techniques like a queuing approach, currently unused memory is reused later on by the framework, permitting the storage of tens of millions of vertices and hundreds of millions of edges in GPU memory. We evaluate algorithmic performance using a PageRank and a Static Triangle Counting (STC) implementation, demonstrating the suitability of the framework even for memory access intensive algorithms.
    High Performance Computing, Networking, Storage and Analysis (SC’18), 2018
  • J25
    A High-Performance Software Graphics Pipeline Architecture for the GPU
    Michael Kenzel, Bernhard Kerbl, Dieter Schmalstieg, Markus Steinberger:
    A High-Performance Software Graphics Pipeline Architecture for the GPU
    Abstract: In this paper, we present a real-time graphics pipeline implemented entirely in software on a modern GPU. As opposed to previous work, our approach features a fully-concurrent, multi-stage, streaming design with dynamic load balancing, capable of operating efficiently within bounded memory. We address issues such as primitive order, vertex reuse, and screen-space deriva- tives of dependent variables, which are essential to real-world applications, but have largely been ignored by comparable work in the past. The power of a software approach lies in the ability to tailor the graphics pipeline to any given application. In exploration of this potential, we design and implement four novel pipeline modifications. Evaluation of the performance of our approach on more than 100 real-world scenes collected from video games shows rendering speeds within one order of magnitude of the hardware graphics pipeline as well as significant improvements over previous work, not only in terms of capabilities and performance, but also robustness.
    ACM Transactions on Graphics (SIGGRAPH'18), 2018
  • J24
    Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU
    Bernhard Kerbl, Michael Kenzel, Elena Ivanchenko, Dieter Schmalstieg, Markus Steinberger:
    Revisiting The Vertex Cache: Understanding and Optimizing Vertex Processing on the modern GPU
    Abstract: In this paper, we question the premise that graphics hardware uses a post-transform cache to avoid redundant vertex shader invocations. A large body of existing work on optimizing indexed triangle sets for rendering speed is based upon this widely-accepted assumption. We conclusively show that this assumption does not hold up on modern graphics hardware. We design and conduct experiments that demonstrate the behavior of current hardware of all major vendors to be inconsistent with the presence of a common post-transform cache. Our results strongly suggest that modern hardware rather relies on a batch-based approach, most likely for reasons of scalability. A more thorough investigation based on these initial experiments allows us to partially uncover the actual strategies implemented on graphics processors today. We reevaluate existing mesh optimization algorithms in light of these new findings and present a new mesh optimization algorithm designed from the ground up to target architectures that rely on batch-based vertex reuse. In an extensive evaluation, we measure and compare the real-world performance of various optimization algorithms on modern hardware. Our results show that some established algorithms still perform well. However, if the batching strategy of the target architecture is known, our approach can significantly outperform these previous state-of-the-art methods.
    Proceedings of the ACM on Computer Graphics and Interaction Techniques (HPG'18), 2018
  • J23
    On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing
    Michael Kenzel, Bernhard Kerbl, Wolfgang Tatzgern, Elena Ivanchenko, Dieter Schmalstieg, Markus Steinberger:
    On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing
    Abstract: Due to its flexibility, compute mode is becoming more and more attractive as a way to implement many of the algorithms part of a state-of-the-art rendering pipeline. A key problem commonly encountered in graphics applications is streaming vertex and geometry processing. In a typical triangle mesh, the same vertex is on average referenced six times. To avoid redundant computation during rendering, a post-transform cache is traditionally employed to reuse vertex processing results. However, such a vertex cache can generally not be implemented efficiently in software and does not scale well as parallelism increases. We explore alternative strategies for reusing per-vertex results on-the-fly during massively-parallel software geometry processing. Given an input stream divided into batches, we analyze the effectiveness of sorting, hashing, and intra-threadgroup communication for identifying and exploiting local reuse potential. We design and present four vertex reuse strategies tailored to modern GPU architectures. We demonstrate that, in a variety of applications, these strategies not only achieve effective reuse of vertex processing results, but can boost performance by up to 2–3x compared to a naïve approach. Curiously, our experiments also show that our batch-based approaches exhibit behavior similar to the OpenGL implementation on current graphics hardware.
    Proceedings of the ACM on Computer Graphics and Interaction Techniques (HPG'18), 2018
  • J22
    On Dynamic Scheduling for the GPU and its Applications in Computer Graphics and Beyond
    Markus Steinberger:
    On Dynamic Scheduling for the GPU and its Applications in Computer Graphics and Beyond
    Abstract: During the last decade we have witnessed a severe change in computing, as processor clock-rates stopped increasing. Thus, the arguable only way to increase processing power is switching to a parallel computing architecture, like the graphics processing unit (GPU). While a GPU offers tremendous processing power, harnessing this power is often difficult. In our research we tackle this issue, providing various components to allow a wider class of algorithms to execute efficiently on the GPU. These efforts include new processing models for dynamic algorithms with various degrees of parallelism, a versatile task scheduler, based on highly efficient work queues which also support dynamic priority scheduling, and efficient dynamic memory management. Our scheduling strategies advance the state-of-the-art algorithms in the field of rendering, visualization, and geometric modeling. In the field of rendering, we provide algorithms that can significantly speed-up image generation, assigning more processing power to the most important image regions. In the field of geometric modeling we provide the first GPU-based grammar evaluation system that can generate and render cities in real-time which otherwise take hours to generate and could not fit into GPU memory. Finally, we show that mesh processing algorithms can be computed significantly faster on the GPU when parallelizing them with advanced scheduling strategies.
    IEEE Computer Graphics and Applications, 2018
  • C19
    The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU
    Bernhard Kerbl, Joerg H. Mueller, Michael Kenzel, Dieter Schmalstieg, Markus Steinberger:
    The Broker Queue: A Fast, Linearizable FIFO Queue for Fine-Granular Work Distribution on the GPU
    Abstract: Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We show that previous queuing approaches are unfit for this task, as they either (1) do not work well in a massively parallel environment, or (2) obstruct the use of individual threads on top of single-instruction-multiple-data (SIMD) cores, or (3) block during access, thus prohibiting multi-queue setups. With these issues in mind, we present the Broker Queue, a highly efficient, fully linearizable FIFO queue for fine-granular parallel work distribution on the GPU. We evaluate its performance and usability on modern GPU models against a wide range of existing algorithms. The Broker Queue is up to three orders of magnitude faster than non-blocking queues and can even outperform significantly simpler techniques that lack desired properties for fine-granular work distribution.
    International Conference on Supercomputing (ICS'18), 2018
  • J21
    A scalable queue for work distribution on GPUs
    Bernhard Kerbl, Michael Kenzel, Joerg H. Mueller, Dieter Schmalstieg, Markus Steinberger:
    A scalable queue for work distribution on GPUs
    Abstract: Harnessing the power of massively parallel devices like the graphics processing unit (GPU) is difficult for algorithms that show dynamic or inhomogeneous workloads. To achieve high performance, such advanced algorithms require scalable, concurrent queues to collect and distribute work. We present a new concurrent work queue, the Broker Queue, a highly efficient, linearizable queue for fine-granular work distribution on the GPU. We evaluate its usability and benefits in contrast to existing queuing algorithms. Our queue is up to one order of magnitude faster than non-blocking queues, and outperforms simpler queue designs that are unfit for fine-granular work distribution.
    ACM SIGPLAN Notices (PPoPP'18), 2018
  • C18
    Sparse Matrix Assembly on the GPU Through Multiplication Patterns
    Rhaleb Zayer, Markus Steinberger, Hans-Peter Seidel:
    Sparse Matrix Assembly on the GPU Through Multiplication Patterns
    Abstract: The numerical treatment of variational problems gives rise to large sparse matrices which are typically assembled by coalescing elementary contributions. As the explicit matrix form is required by direct solvers, the assembly step can be a potential bottleneck especially in implicit and time dependent settings where considerable updates are needed. On standard HPC platforms, this process can be vectorized by taking advantage of additional mesh querying data structures. However, on graphics hardware, vectorization is inhibited by limited memory resources. To address this issue, we propose a lean unstructured mesh representation which allows casting the assembly problem as a sparse matrix-matrix multiplication. In this way, the underlying graph connectivity of the assembled matrix is captured through basic matrix algebra operations. Furthermore, we outline ideas for cutting down on data movement by lowering the memory footprint of the proposed representation and we analyze the effect of its memory layout on the assembly performance. As the underlying machinery behind our approach consists mainly of key linear algebra primitives, we foresee potential portability opportunities for our approach to various architectures.
    High Performance Extreme Computing, 2017
  • C17
    Autonomous, Independent Management of Dynamic Graphs on GPUs
    Martin Winter, Rhaleb Zayer, Markus Steinberger:
    Autonomous, Independent Management of Dynamic Graphs on GPUs
    Abstract: In this paper, we present a new, dynamic graph data structure, built to deliver high update rates while keeping a low memory footprint using autonomous memory management directly on the GPU. By transferring the memory management to the GPU, efficient updating of the graph structure and fast initialization times are enabled as no additional memory allocation calls or reallocation procedures are necessary since they are handled directly on the device. In comparison to previous work, this optimized approach allows for significantly lower initialization times (up to 300x faster) and much higher update rates for significant changes to the graph structure and equal rates for small changes. The framework provides different update implementations tailored specifically to different graph properties, enabling over 100 million of updates per second and keeping tens of millions of vertices and hundreds of millions of edges in memory without transferring data back and forth between device and host.
    HPEC'17 Best Student Paper
    High Performance Extreme Computing, 2017

  • C16
    Effective static bin patterns for sort-middle rendering
    Bernhard Kerbl, Michael Kenzel, Dieter Schmalstieg, Markus Steinberger:
    Effective static bin patterns for sort-middle rendering
    Abstract: To effectively utilize an ever increasing number of processors during parallel rendering, hardware and software designers rely on sophisticated load balancing strategies. While dynamic load balancing is a powerful solution, it requires complex work distribution and synchronization mechanisms. Graphics hardware manufacturers have opted to employ static load balancing strategies instead. Specifically, triangle data is distributed to processors based on its overlap with screenspace tiles arranged in a fixed pattern. While the current strategy of using simple patterns for a small number of fast rasterizers achieves formidable performance, it is questionable how this approach will scale as the number of processors increases further. To address this issue, we analyze real-world rendering workloads, derive requirements for effective patterns, and present ten different pattern design strategies based on these requirements. In addition to a theoretical evaluation of these design strategies, we compare the performance of select patterns in a parallel sort-middle software rendering pipeline on an extensive set of triangle data captured from eight recent video games. As a result, we are able to identify a set of patterns that scale well and exhibit significantly improved performance over naïve approaches.
    High Performance Graphics (HPG '17), 2017
  • C15
    Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU
    Markus Steinberger, Rhaleb Zayer, Hans-Peter Seidel:
    Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU
    Abstract: The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, most existing GPU-SpMV approaches trade off one aspect of performance against another. They either require preprocessing, exhibit inconsistent behavior, lead to execution divergence, suffer load imbalance or induce detrimental memory access patterns. In this paper, we present an uncompromising approach for SpMV on the GPU. Our approach requires no separate preprocessing or knowledge of the matrix structure and works directly on the standard compressed sparse rows (CSR) data format. From a global perspective, it exhibits a homogeneous behavior reflected in efficient memory access patterns and steady per-thread workload. From a local perspective, it avoids heterogeneous execution paths by adapting its behavior to the work load at hand, it uses an efficient encoding to keep temporary data requirements for on-chip memory low, and leads to divergence-free execution. We evaluate our approach on more than 2500 matrices comparing to vendor provided, and state-of-the-art SpMV implementations. Our approach not only significantly outperforms approaches directly operating on the CSR format ( 20% average performance increase), but also outperforms approaches that preprocess the matrix even when preprocessing time is discarded. Additionally, the same strategies lead to significant performance increase when adapted for transpose SpMV.
    International Conference on Supercomputing (ICS'17), 2017
  • C14
    Dynamic scheduling for efficient hierarchical sparse matrix operations on the GPU
    Andreas Derler, Rhaleb Zayer, Hans-Peter Seidel, Markus Steinberger:
    Dynamic scheduling for efficient hierarchical sparse matrix operations on the GPU
    Abstract: We introduce a hierarchical sparse matrix representation (HiSparse) tailored for the graphics processing unit (GPU). The representation adapts to the local nonzero pattern at all levels of the hierarchy and uses reduced bit length for addressing the entries. This allows a smaller memory footprint than standard formats. Executing algorithms on a hierarchical structure on the GPU usually entails significant synchronization and management overhead or slowdowns due to diverging execution paths and memory access patterns. We address these issues by means of a dynamic scheduling strategy specifically designed for executing algorithms on top of a hierarchical matrix on the GPU. The evaluation of our implementation of basic linear algebra routines, suggests that our hierarchical format is competitive to highly optimized standard libraries and significantly outperforms them in the case of transpose matrix operations. The results point towards the viability of hierarchical matrix formats on massively parallel devices such as the GPU.
    International Conference on Supercomputing (ICS'17), 2017
  • J20
    ShapeGenetics: Using Genetic Algorithms for Procedural Modeling
    Karl Haubenwallner, Hans-Peter Seidel, Markus Steinberger:
    ShapeGenetics: Using Genetic Algorithms for Procedural Modeling
    Abstract: In this paper, we show that genetic algorithms (GA) can be used to control the output of procedural modeling algorithms. We propose an efficient way to encode the choices that have to be made during a procedural generation as a hierarchical genome representation. In combination with mutation and reproduction operations specifically designed for controlled procedural modeling, our GA can evolve a population of individual models close to any high-level goal. Possible scenarios include a volume that should be filled by a procedurally grown tree or a painted silhouette that should be followed by the skyline of a procedurally generated city. These goals are easy to set up for an artist compared to the tens of thousands of variables that describe the generated model and are chosen by the GA. Previous approaches for controlled procedural modeling either use Reversible Jump Markov Chain Monte Carlo (RJMCMC) or Stochastically-Ordered Sequential Monte Carlo (SOSMC) as workhorse for the optimization. While RJMCMC converges slowly, requiring multiple hours for the optimization of larger models, it produces high quality models. SOSMC shows faster convergence under tight time constraints for many models, but can get stuck due to choices made in the early stages of optimization. Our GA shows faster convergence than SOSMC and generates better models than RJMCMC in the long run.
    Computer Graphics Forum / Eurographics (EG'17), 2017
  • J19
    A GPU-adapted Structure for Unstructured Grids
    Rhaleb Zayer, Markus Steinberger, Hans-Peter Seidel:
    A GPU-adapted Structure for Unstructured Grids
    Abstract: A key advantage of working with structured grids (e.g., images) is the ability to directly tap into the powerful machinery of linear algebra. This is not much so for unstructured grids where intermediate bookkeeping data structures stand in the way. On modern high performance computing hardware, the conventional wisdom behind these intermediate structures is further challenged by costly memory access, and more importantly by prohibitive memory resources on environments such as graphics hardware. In this paper, we bypass this problem by introducing a sparse matrix representation for unstructured grids which not only reduces the memory storage requirements but also cuts down on the bulk of data movement from global storage to the compute units. In order to take full advantage of the proposed representation, we augment ordinary matrix multiplication by means of action maps, local maps which encode the desired interaction between grid vertices. In this way, geometric computations and topological modifications translate into concise linear algebra operations. In our algorithmic formulation, we capitalize on the nature of sparse matrix-vector multiplication which allows avoiding explicit transpose computation and storage. Furthermore, we develop an efficient vectorization to the demanding assembly process of standard graph and finite element matrices.
    Computer Graphics Forum / Eurographics (EG'17), 2017
  • J18
    Hierarchical Bucket Queuing for Fine-Grained Priority Scheduling on the GPU
    Bernhard Kerbl, Michael Kenzel, Dieter Schmalstieg, Hans-Peter Seidel, Markus Steinberger:
    Hierarchical Bucket Queuing for Fine-Grained Priority Scheduling on the GPU
    Abstract: While the modern graphics processing unit (GPU) offers massive parallel compute power, the ability to influence the scheduling of these immense resources is severely limited. Therefore, the GPU is widely considered to be only suitable as an externally controlled co-processor for homogeneous workloads which greatly restricts the potential applications of GPU computing. To address this issue, we present a new method to achieve fine-grained priority scheduling on the GPU: hierarchical bucket queuing. By carefully distributing the workload among multiple queues and efficiently deciding which queue to draw work from next, we enable a variety of scheduling strategies. These strategies include fair-scheduling, earliest-deadline-first scheduling and user-defined dynamic priority scheduling. In a comparison with a sorting-based approach, we reveal the advantages of hierarchical bucket queuing over previous work. Finally, we demonstrate the benefits of using priority scheduling in real-world applications by example of path tracing and foveated micropolygon rendering.
    Computer Graphics Forum, 2016
  • J17
    Representing and Scheduling Procedural Generation using Operator Graphs
    Pedro Boechat, Mark Dokter, Michael Kenzel, Hans-Peter Seidel, Dieter Schmalstieg, Markus Steinberger:
    Representing and Scheduling Procedural Generation using Operator Graphs
    Abstract: In this paper, we present the concept of operator graph scheduling for high performance procedural generation on the graphics processing unit (GPU). The operator graph forms an intermediate representation that describes all possible operations and objects that can arise during a specific procedural generation. While previous methods have focused on parallelizing a specific procedural approach, the operator graph is applicable to all procedural generation methods that can be described by a graph, such as L-systems, shape grammars, or stack based generation methods. Using the operator graph, we show that all partitions of the graph correspond to possible ways of scheduling a procedural generation on the GPU, including the scheduling strategies of previous work. As the space of possible partitions is very large, we describe three search heuristics, aiding an optimizer in finding the fastest valid schedule for any given operator graph. The best partitions found by our optimizer increase performance of 8 to 30x over the previous state of the art in GPU shape grammar and L-system generation.
    ACM Transactions on Graphics (SIGGRAPH Asia'16), 2016
  • C13
    How naive is naive SpMV on the GPU?
    Markus Steinberger, Andreas Derler, Rhaleb Zayer, Hans-Peter Seidel:
    How naive is naive SpMV on the GPU?
    Abstract: Sparse matrix vector multiplication (SpMV) is the workhorse for a wide range of linear algebra computations. In a serial setting, naive implementations for direct multiplication and transposed multiplication achieve very competitive performance. In parallel settings, especially on graphics hardware, it is widely believed that naive implementations cannot reach the performance of highly tuned parallel implementations and complex data formats. Most often, the cost for data conversion to these specialized formats as well as the cost for transpose operations are neglected, as they do not arise in all algorithms. In this paper, we revisit the naive implementation of SpMV for the GPU. Relying on recent advances in GPU hardware, such as fast hardware supported atomic operations and better cache performance, we show that a naive implementation can reach the performance of state-of-the-art SpMV implementations. In case the cost of format conversion and transposition cannot be amortized over many SpMV operations a naive implementation can even outperform state-of-the-art implementations significantly. Experimental results over a variety of data sets suggest that the adoption of the naive serial implementation to the GPU is not as inferior as it used to be on previous hardware generations. The integration of some naive strategies can potentially speed up state-of-the-art GPU SpMV implementations, especially in the transpose case.
    HPEC '16 Best Paper Nominee
    High Performance Extreme Computing, 2016

  • C12
    Visualization-Guided Evaluation of Simulated Minimally Invasive Cancer Treatment
    Philip Voglreiter, Michael Hofmann, Christoph Ebner, Roberto Blanco Sequeiros, Horst Rupert Portugaller, Jurgen Fütterer, Michael Moche, Markus Steinberger, Dieter Schmalstieg:
    Visualization-Guided Evaluation of Simulated Minimally Invasive Cancer Treatment
    Abstract: We present a visualization application supporting interventional radiologists during analysis of simulated minimally invasive cancer treatment. The current clinical practice employs only rudimentary, manual measurement tools. Our system provides visual support throughout three evaluation stages, starting with determining prospective treatment success of the simulation parameterization. In case of insufficiencies, Stage 2 includes a simulation scalar field for determining a new configuration of the simulation. For complex cases, where Stage 2 does not lead to a decisive strategy, Stage 3 reinforces analysis of interdependencies of scalar fields via bivariate visualization. Our system is designed to be immediate applicable in medical practice. We analyze the design space of potentially useful visualization techniques and appraise their effectiveness in the context of our design goals. Furthermore, we present a user study, which reveals the disadvantages of manual analysis in the measurement stage of evaluation and highlight the demand for computer-support through our system.
    Visual Computing for Biology and Medicine, 2016
  • J16
    Fast ANN for High-Quality Collaborative Filtering
    Yun-Ta Tsai, Markus Steinberger, Dawid Pajak, Kari Pulli:
    Fast ANN for High-Quality Collaborative Filtering
    Abstract: Collaborative filtering collects similar patches, jointly filters them and scatters the output back to input patches; each pixel gets a contribution from each patch that overlaps with it, allowing signal reconstruction from highly corrupted data. Exploiting self-similarity, however, requires finding matching image patches, which is an expensive operation. We propose a GPU-friendly approximated-nearest-neighbour(ANN) algorithm that produces high-quality results for any type of collaborative filter. We evaluate our ANN search against state-of-the-art ANN algorithms in several application domains. Our method is orders of magnitudes faster, yet provides similar or higher quality results than the previous work.
    Computer Graphics Forum, 2016
  • J15
    Fast volume reconstruction from motion corrupted stacks of 2D slices
    Bernhard Kainz, Markus Steinberger, Wolfgang Wein, Maria Kuklisova-Murgasova, Christina Malamateniou, Kevin Keraudren, Thomas Torsney-Weir, Mary Rutherford, Paul Aljabar, Joseph V Hajnal, Daniel Rueckert:
    Fast volume reconstruction from motion corrupted stacks of 2D slices
    Abstract: Capturing an enclosing volume of moving subjects and organs using fast individual image slice acquisition has shown promise in dealing with motion artefacts. Motion between slice acquisitions results in spatial inconsistencies that can be resolved by slice-to-volume reconstruction (SVR) methods to provide high quality 3D image data. Existing algorithms are, however, typically very slow, specialised to specific applications and rely on approximations, which impedes their potential clinical use. In this paper, we present a fast multi-GPU accelerated framework for slice-to-volume reconstruction. It is based on optimised 2D/3D registration, super-resolution with automatic outlier rejection and an additional (optional) intensity bias correction. We introduce a novel and fully automatic procedure for selecting the image stack with least motion to serve as an initial registration target. We evaluate the proposed method using artificial motion corrupted phantom data as well as clinical data, including tracked freehand ultrasound of the liver and fetal Magnetic Resonance Imaging. We achieve speed-up factors greater than 30 compared to a single CPU system and greater than 10 compared to currently available state-of-the-art multi-core CPU methods. We ensure high reconstruction accuracy by exact computation of the point-spread function for every input data point, which has not previously been possible due to computational limitations. Our framework and its implementation is scalable for available computational infrastructures and tests show a speed-up factor of 1.70 for each additional GPU. This paves the way for the online application of image based reconstruction methods during clinical examinations. The source code for the proposed approach is publicly available.
    IEEE Transactions on Medical Imaging, 2015
  • J14
    Interactive Disassembly Planning for Complex Objects
    Bernhard Kerbl, Denis Kalkofen, Markus Steinberger, Dieter Schmalstieg:
    Interactive Disassembly Planning for Complex Objects
    Abstract: We present an approach for the automatic generation, interactive exploration and real-time modification of disassembly procedures for complex, multipartite CAD data sets. In order to lift the performance barriers prohibiting interactive disassembly planning, we run a detailed analysis on the input model to identify recurring part constellations and efficiently determine blocked part motions in parallel on the GPU. Building on the extracted information, we present an interface for computing and editing extensive disassembly sequences in real-time while considering user-defined constraints and avoiding unstable configurations. To evaluate the performance of our C++/CUDA implementation, we use a variety of openly available CAD data sets, ranging from simple to highly complex. In contrast to previous approaches, our work enables interactive disassembly planning for objects which consist of several thousand parts and require cascaded translations during part removal.
    Computer Graphics Forum / Eurographics (EG'15), 2015
  • J13
    An overview of dynamic resource scheduling on graphics processors
    Markus Steinberger:
    An overview of dynamic resource scheduling on graphics processors
    Abstract: In this paper, we present a series of scheduling approaches targeted for massively parallel architectures, which in combination allow a wider range of algorithms to be executed on modern graphics processors. At first, we describe a new processing model which enables the efficient execution of dynamic, irregular workloads. Then, we present the currently fastest queuing algorithm for graphics processors, the most efficient dynamic memory allocator for massively parallel architectures, and the only autonomous scheduler for graphics processing units that can dynamically support different granularities of parallelism. Finally, we show how these scheduling approaches help to advance the state-of-the-art in the rendering, visualization and procedural modeling.
    it - Information Technology, 2015
  • C11
    Reyes Rendering on the GPU
    Marting Sattlecker, Markus Steinberger:
    Reyes Rendering on the GPU
    Abstract: In this paper we investigate the possibility of real-time Reyes rendering with advanced effects such as displacement mapping and multidimensional rasterization on current graphics hardware. We describe a first GPU Reyes implementation executing within an autonomously executing persistent Megakernel. To support high quality rendering, we integrate displacement mapping into the renderer, which has only marginal impact on performance. To investigate rasterization for Reyes, we start with an approach similar to nearest neighbor filling, before presenting a precise sampling algorithm. This algorithm can be enhanced to support motion blur and depth of field using three dimensional sampling. To evaluate the performance quality trade-off of these effects, we compare three approaches: coherent sampling across time and on the lens, essentially overlaying images; randomized sampling along all dimensions; and repetitive randomization, in which the randomization pattern is repeated among subgroups of pixels. We evaluate all approaches, showing that high quality images can be generated with interactive to real-time refresh rates for advanced Reyes features.
    Spring Conference on Computer Graphics, 2015
  • PA01
    Efficient Approximate-Nearest-Neighbor (ANN) Search for High-Quality Collaborative Filtering
    Dawid Pajak, Yun-Ta Tsai, Markus Steinberger:
    Efficient Approximate-Nearest-Neighbor (ANN) Search for High-Quality Collaborative Filtering
    Abstract: A computer implemented method of performing an approximate-nearest-neighbor search is disclosed. The method comprises dividing an image into a plurality of tiles. Further, for each of the plurality of tiles, perform the following in parallel on a processor: (a) dividing image patches into a plurality of clusters, wherein each cluster comprises similar images patches, and wherein the dividing continues recursively until a size of a cluster is below a threshold value; (b) performing a nearest-neighbor query within each of the plurality of clusters; and (c) performing collaborative filtering in parallel for each image patch, wherein the collaborative filtering aggregates and processes nearest neighbor image patches from a same cluster containing a respective image patch to form an output image
    US Patent, 2015
  • J12
    Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
    Markus Steinberger, Michael Kenzel, Pedro Boechat, Bernhard Kerbl, Mark Dokter, Dieter Schmalstieg:
    Whippletree: Task-based Scheduling of Dynamic Workloads on the GPU
    Abstract: Conventional pipelines for capturing, displaying, and storing images are usually defined as a series of cascaded modules, each responsible for addressing a particular problem. While this divide-and-conquer approach offers many benefits, it also introduces a cumulative error, as each step in the pipeline only considers the output of the previous step, not the original sensor data. We propose an end-to-end system that is aware of the camera and image model, enforces natural-image priors, while jointly accounting for common image processing steps like demosaicking, denoising, deconvolution, and so forth, all directly in a given output representation (e.g., YUV, DCT). Our system is flexible and we demonstrate it on regular Bayer images as well as images from custom sensors. In all cases, we achieve large improvements in image quality and signal reconstruction compared to state-of-the-art techniques. Finally, we show that our approach is capable of very efficiently handling high-resolution images, making even mobile implementations feasible.
    ACM Transactions on Graphics (SIGGRAPH Asia'14), 2014
  • J11
    FlexISP: A flexible camera image processing framework
    Felix Heide, Markus Steinberger, Yun-Ta Tsai, Nasa Rouf, Dawid Pajak, Dikpal Reddy, Orazio Gallo, Jing Liu, Wolfgang Heidrich, Karen Egiazarian, Jan Kautz, Kari Pulli:
    FlexISP: A flexible camera image processing framework
    Abstract: Conventional pipelines for capturing, displaying, and storing images are usually defined as a series of cascaded modules, each responsible for addressing a particular problem. While this divide-and-conquer approach offers many benefits, it also introduces a cumulative error, as each step in the pipeline only considers the output of the previous step, not the original sensor data. We propose an end-to-end system that is aware of the camera and image model, enforces natural-image priors, while jointly accounting for common image processing steps like demosaicking, denoising, deconvolution, and so forth, all directly in a given output representation (e.g., YUV, DCT). Our system is flexible and we demonstrate it on regular Bayer images as well as images from custom sensors. In all cases, we achieve large improvements in image quality and signal reconstruction compared to state-of-the-art techniques. Finally, we show that our approach is capable of very efficiently handling high-resolution images, making even mobile implementations feasible.
    ACM Transactions on Graphics (SIGGRAPH Asia'14), 2014
  • C10
    Fast ANN for High-Quality Collaborative Filtering
    Yun-Ta Tsai, Markus Steinberger, Dawid Pajak, Kari Pulli:
    Fast ANN for High-Quality Collaborative Filtering
    Abstract: Collaborative filtering collects similar patches, jointly filters them, and scatters the output back to input patches; each pixel gets a contribution from each patch that overlaps with it, allowing signal reconstruction from highly corrupted data. Exploiting self-similarity, however, requires finding matching image patches, which is an expensive operation. We propose a GPU-friendly approximated-nearest-neighbor algorithm that produces high-quality results for any type of collaborative filter. We evaluate our ANN search against state-of-the-art ANN algorithms in several application domains. Our method is orders of magnitudes faster, yet provides similar or higher-quality results than the previous work.
    HPG'14 Best Paper Award
    High Performance Graphics (HPG '14), 2014

  • O02
    Dynamisches Ressourcen Scheduling auf Grafik Prozessoren
    Markus Steinberger:
    Dynamisches Ressourcen Scheduling auf Grafik Prozessoren
    Ausgezeichnete Informatikdissertationen 2013 (German), 2014
  • J10
    Parallel Irradiance Caching for Interactive Monte-Carlo Direct Volume Rendering
    Rostislav Khlebnikov, Philip Voglreiter, Markus Steinberger, Bernhard Kainz, Dieter Schmalstieg:
    Parallel Irradiance Caching for Interactive Monte-Carlo Direct Volume Rendering
    Abstract: We propose a technique to build the irradiance cache for isotropic scattering simultaneously with Monte Carlo progressive direct volume rendering on a single GPU, which allows us to achieve up to four times increased convergence rate for complex scenes with arbitrary sources of light. We use three procedures that run concurrently on a single GPU. The first is the main rendering procedure. The second procedure computes new cache entries, and the third one corrects the errors that may arise after creation of new cache entries. We propose two distinct approaches to allow massive parallelism of cache entry creation. In addition, we show a novel extrapolation approach which outputs high quality irradiance approximations and a suitable prioritization scheme to increase the convergence rate by dedicating more computational power to more complex rendering areas.
    Computer Graphics Forum / EuroVis, 2014
  • C09
    Show Me the Invisible: Visualizing Hidden Content
    Thomas Geymayer, Markus Steinberger, Alexander Lex, Marc Streit, Dieter Schmalstieg:
    Show Me the Invisible: Visualizing Hidden Content
    Abstract: Content on computer screens is often inaccessible to users because it is hidden, e.g., occluded by other windows, outside the viewport, or overlooked. In search tasks, the efficient retrieval of sought content is important. Current software, however, only provides limited support to visualize hidden occurrences and rarely supports search synchronization crossing application boundaries. To remedy this situation, we introduce two novel visualization methods to guide users to hidden content. Our first method generates awareness for occluded or out-of-viewport content using see-through visualization. For content that is either outside the screen's viewport or for data sources not opened at all, our second method shows off-screen indicators and an on-demand smart preview. To reduce the chances of overlooking content, we use visual links, i.e., visible edges, to connect the visible content or the visible representations of the hidden content. We show the validity of our methods in a user study, which demonstrates that our technique enables a faster localization of hidden content compared to traditional search functionality and thereby assists users in information retrieval tasks.
    CHI '14 Honorable Mention Award
    SIGCHI Conference on Human Factors in Computing Systems (CHI'14), 2014

  • J09
    On-the-fly Generation and Rendering of Infinite Cities on the GPU
    Markus Steinberger, Michael Kenzel, Bernhard Kainz, Peter Wonka, Dieter Schmalstieg:
    On-the-fly Generation and Rendering of Infinite Cities on the GPU
    Abstract: In this paper, we present a new approach for shape-grammar-based generation and rendering of huge cities in real-time on the graphics processing unit GPU. Traditional approaches rely on evaluating a shape grammar and storing the geometry produced as a preprocessing step. During rendering, the pregenerated data is then streamed to the GPU. By interweaving generation and rendering, we overcome the problems and limitations of streaming pregenerated data. Using our methods of visibility pruning and adaptive level of detail, we are able to dynamically generate only the geometry needed to render the current view in real-time directly on the GPU. We also present a robust and efficient way to dynamically update a scene's derivation tree and geometry, enabling us to exploit frame-to-frame coherence. Our combined generation and rendering is significantly faster than all previous work. For detailed scenes, we are capable of generating geometry more rapidly than even just copying pregenerated data from main memory, enabling us to render cities with thousands of buildings at up to 100 frames per second, even with the camera moving at supersonic speed.
    Computer Graphics Forum / Eurographics (EG'14), 2014
  • J08
    Parallel Generation of Architecture on the GPU
    Markus Steinberger, Michael Kenzel, Bernhard Kainz, Jörg Müller, Peter Wonka, Dieter Schmalstieg:
    Parallel Generation of Architecture on the GPU
    Abstract: In this paper, we present a novel approach for the parallel evaluation of procedural shape grammars on the graphics processing unit GPU. Unlike previous approaches that are either limited in the kind of shapes they allow, the amount of parallelism they can take advantage of, or both, our method supports state of the art procedural modeling including stochasticity and context-sensitivity. To increase parallelism, we explicitly express independence in the grammar, reduce inter-rule dependencies required for context-sensitive evaluation, and introduce intra-rule parallelism. Our rule scheduling scheme avoids unnecessary back and forth between CPU and GPU and reduces round trips to slow global memory by dynamically grouping rules in on-chip shared memory. Our GPU shape grammar implementation is multiple orders of magnitude faster than the standard in CPU-based rule evaluation, while offering equal expressive power. In comparison to the state of the art in GPU shape grammar derivation, our approach is nearly 50 times faster, while adding support for geometric context-sensitivity.
    Honorable Mention Award (among top 3 papers)
    Computer Graphics Forum / Eurographics (EG'14), 2014

  • J07
    Noise-based volume rendering for the visualization of multivariate volumetric data
    Rostislav Khlebnikov, Bernhard Kainz, Markus Steinberger, Dieter Schmalstieg:
    Noise-based volume rendering for the visualization of multivariate volumetric data
    Abstract: Analysis of multivariate data is of great importance in many scientific disciplines. However, visualization of 3D spatially-fixed multivariate volumetric data is a very challenging task. In this paper we present a method that allows simultaneous real-time visualization of multivariate data. We redistribute the opacity within a voxel to improve the readability of the color defined by a regular transfer function, and to maintain the see-through capabilities of volume rendering. We use predictable procedural noise--random-phase Gabor noise--to generate a high-frequency redistribution pattern and construct an opacity mapping function, which allows to partition the available space among the displayed data attributes. This mapping function is appropriately filtered to avoid aliasing, while maintaining transparent regions. We show the usefulness of our approach on various data sets and with different example applications. Furthermore, we evaluate our method by comparing it to other visualization techniques in a controlled user study. Overall, the results of our study indicate that users are much more accurate in determining exact data values with our novel 3D volume visualization method. Significantly lower error rates for reading data values and high subjective ranking of our method imply that it has a high chance of being adopted for the purpose of visualization of multivariate 3D data.
    Transactions on Visualization and Computer Graphics (Vis'13), 2013
  • P02
    Volume Rendering with Advanced GPU Scheduling Strategies
    Philip Voglreiter, Markus Steinberger, Rostislav Khlebnikov, Bernhard Kainz, Dieter Schmalstieg:
    Volume Rendering with Advanced GPU Scheduling Strategies
    Abstract: Modern GPUs are powerful enough to enable interactive display of high-quality volume data even despite the fact that many volume rendering methods do not present a natural fit for current GPU hardware. However, there still is a vast amount of computational power that remains unused due to the inefficient use of the available hardware. In this work, we demonstrate how advanced scheduling methods can be employed to implement volume rendering algorithms in a way that better utilizes the GPU by example of three different state-of-the-art volume rendering techniques
    VIS'13 Honorable Mention Poster Award
    IEEE Scientific Visualization poster (Vis'13), 2013

  • C08
    Adaptive Ghosted Views for Augmented Reality
    Denis Kalkofen, Eduardo Veas, Stefanie Zollmann, Markus Steinberger, Dieter Schmalstieg:
    Adaptive Ghosted Views for Augmented Reality
    Abstract: In Augmented Reality (AR), ghosted views allow a viewer to explore hidden structure within the real-world environment. A body of previous work has explored which features are suitable to support the structural interplay between occluding and occluded elements. However, the dynamics of AR environments pose serious challenges to the presentation of ghosted views. While a model of the real world may help determine distinctive structural features, changes in appearance or illumination detriment the composition of occluding and occluded structure. In this paper, we present an approach that considers the information value of the scene before and after generating the ghosted view. Hereby, a contrast adjustment of preserved occluding features is calculated, which adaptively varies their visual saliency within the ghosted view visualization. This allows us to not only preserve important features, but to also support their prominence after revealing occluded structure, thus achieving a positive effect on the perception of ghosted views.
    International Symposium on Mixed and Augmented Reality (ISMAR'13), 2013
  • C07
    OmniKinect: Real-Time Dense Volumetric Data Acquisition and Applications
    Bernhard Kainz, Stefan Hauswiesner, Gerhard Reitmayr, Markus Steinberger, Raphael Grasset, Lukas Gruber, Eduardo Veas, Denis Kalkofen, Hartmut Seichter, Dieter Schmalstieg:
    OmniKinect: Real-Time Dense Volumetric Data Acquisition and Applications
    Abstract: Real-time three-dimensional acquisition of real-world scenes has many important applications in computer graphics, computer vision and human-computer interaction. Inexpensive depth sensors such as the Microsoft Kinect allow to leverage the development of such applications. However, this technology is still relatively recent, and no detailed studies on its scalability to dense and view-independent acquisition have been reported. This paper addresses the question of what can be done with a larger number of Kinects used simultaneously. We describe an interference-reducing physical setup, a calibration procedure and an extension to the KinectFusion algorithm, which allows to produce high quality volumetric reconstructions from multiple Kinects whilst overcoming systematic errors in the depth measurements. We also report on enhancing image based visual hull rendering by depth measurements, and compare the results to KinectFusion. Our system provides practical insight into achievable spatial and radial range and into bandwidth requirements for depth data acquisition. Finally, we present a number of practical applications of our system.
    Symposium On Virtual Reality Software And Technology (VRST'12), 2012
  • J06
    Softshell: Dynamic Scheduling on GPUs
    Markus Steinberger, Bernhard Kainz, Bernhard Kerbl, Stefan Hauswiesner, Michael Kenzel, Dieter Schmalstieg:
    Softshell: Dynamic Scheduling on GPUs
    Abstract: In this paper we present Softshell, a novel execution model for devices composed of multiple processing cores operating in a single instruction, multiple data fashion, such as graphics processing units (GPUs). The Softshell model is intuitive and more flexible than the kernel-based adaption of the stream processing model, which is currently the dominant model for general purpose GPU computation. Using the Softshell model, algorithms with a relatively low local degree of parallelism can execute efficiently on massively parallel architectures. Softshell has the following distinct advantages: (1) work can be dynamically issued directly on the device, eliminating the need for synchronization with an external source, i.e., the CPU; (2) its three-tier dynamic scheduler supports arbitrary scheduling strategies, including dynamic priorities and real-time scheduling; and (3) the user can influence, pause, and cancel work already submitted for parallel execution. The Softshell processing model thus brings capabilities to GPU architectures that were previously only known from operating-system designs and reserved for CPU programming. As a proof of our claims, we present a publicly available implementation of the Softshell processing model realized on top of CUDA. The benchmarks of this implementation demonstrate that our processing model is easy to use and also performs substantially better than the state-of-the-art kernel-based processing model for problems that have been difficult to parallelize in the past.
    ACM Transactions on Graphics (SIGGRAPH Asia'12), 2012
  • C06
    Volumetric Real-Time Particle-Based Representation of Large Unstructured Tetrahedral Polygon Meshes
    Philip Voglreiter, Markus Steinberger, Dieter Schmalstieg, Bernhard Kainz:
    Volumetric Real-Time Particle-Based Representation of Large Unstructured Tetrahedral Polygon Meshes
    Abstract: In this paper we propose a particle-based volume rendering approach for unstructured, three-dimensional, tetrahedral polygon meshes. We stochastically generate millions of particles per second and project them on the screen in real-time. In contrast to previous rendering techniques of tetrahedral volume meshes, our method does not need a prior depth sorting of geometry. Instead, the rendered image is generated by choosing particles closest to the camera. Furthermore, we use spatial superimposing. Each pixel is constructed from multiple subpixels. This approach not only increases projection accuracy, but allows also a combination of subpixels into one superpixel that creates the well-known translucency effect of volume rendering. We show that our method is fast enough for the visualization of unstructured three-dimensional grids with hard real-time constraints and that it scales well for a high number of particles.
    Mesh Processing in Medical Image Analysis (MICCAI'12), 2012
  • J05
    Procedural Texture Synthesis for Zoom-Independent Visualization of Multivariate Data
    Rostislav Khlebnikov, Bernhard Kainz, Markus Steinberger, Marc Streit, Dieter Schmalstieg:
    Procedural Texture Synthesis for Zoom-Independent Visualization of Multivariate Data
    Abstract: Simultaneous visualization of multiple continuous data attributes in a single visualization is a task that is important for many application areas. Unsurprisingly, many methods have been proposed to solve this task. However, the behavior of such methods during the exploration stage, when the user tries to understand the data with panning and zooming, has not been given much attention. In this paper, we propose a method that uses procedural texture synthesis to create zoom-independent visualizations of three scalar data attributes. The method is based on random-phase Gabor noise, whose frequency is adapted for the visualization of the first data attribute. We ensure that the resulting texture frequency lies in the range that is perceived well by the human visual system at any zoom level. To enhance the perception of this attribute, we also apply a specially constructed transfer function that is based on statistical properties of the noise. Additionally, the transfer function is constructed in a way that it does not introduce any aliasing to the texture. We map the second attribute to the texture orientation. The third attribute is color coded and combined with the texture by modifying the value component of the HSV color model. The necessary contrast needed for texture and color perception was determined in a user study. In addition, we conducted a second user study that shows significant advantages of our method over current methods with similar goals. We believe that our method is an important step towards creating methods that not only succeed in visualizing multiple data attributes, but also adapt to the behavior of the user during the data exploration stage.
    Computer Graphics Forum (EuroVis12), 2012
  • J04
    Interactive Self-Organizing Windows
    Markus Steinberger, Manuela Waldner, Dieter Schmalstieg:
    Interactive Self-Organizing Windows
    Abstract: In this paper, we present the design and implementation of a dynamic window management technique that changes the perception of windows as fixed-sized rectangles. The primary goal of self-organizing windows is to automatically display the most relevant information for a user's current activity, which removes the burden of organizing and arranging windows from the user. We analyze the image-based representation of each window and identify coherent pieces of information. The windows are then automatically moved, scaled and composed in a contentaware manner to fit the most relevant information into the limited area of the screen. During the design process, we consider findings from previous experiments and show how users can benefit from our system. We also describe how the immense processing power of current graphics processing units can be exploited to build an interactive system that finds an optimal solution within the complex design space of all possible window transformations in real time.
    Computer Graphics Forum / Eurographics (EG'12), 2012
  • C05
    ScatterAlloc: Massively Parallel Dynamic Memory Allocation for the GPU
    Markus Steinberger, Michael Kenzel, Bernhard Kainz, Dieter Schmalstieg:
    ScatterAlloc: Massively Parallel Dynamic Memory Allocation for the GPU
    Abstract: In this paper, we analyze the special requirements of a dynamic memory allocator that is designed for massively parallel architectures such as Graphics Processing Units (GPUs). We show that traditional strategies, which work well on CPUs, are not well suited for the use on GPUs and present the thorough design of ScatterAlloc, which can efficiently deal with hundreds of requests in parallel. Our allocator greatly reduces collisions and congestion by scattering memory requests based on hashing. We analyze ScatterAlloc in terms of allocation speed, data access time and fragmentation, and compare it to current state-of-the-art allocators, including the one provided with the NVIDIA CUDA toolkit. Our results show, that ScatterAlloc clearly outperforms these other approaches, yielding speed-ups between 10 to 100.
    Innovative Parallel Computing (InPar'12), 2012
  • C04
    Multi-GPU Image-based Visual Hull Rendering
    Stefan Hauswiesner, Rostislav Khlebnikov, Markus Steinberger, Mathias Straka, Gerhard Reitmayr:
    Multi-GPU Image-based Visual Hull Rendering
    Abstract: Many virtual mirror and telepresence applications require novel viewpoint synthesis with little latency to user motion. Image-based visual hull (IBVH) rendering is capable of rendering arbitrary views from segmented images without an explicit intermediate data representation, such as a mesh or a voxel grid. By computing depth images directly from the silhouette images, it usually outperforms indirect methods. GPU-hardware accelerated implementations exist, but due to the lack of an intermediate representation no multi-GPU parallel strategies and implementations are currently available. This paper suggests three ways to parallelize the IBVH-pipeline and maps them to the sorting classification that is often applied to conventional parallel rendering systems. In addition to sort-first parallelization, we suggest a novel sort-last formulation that regards cameras as scene objects. We enhance this method’s performance by a block-based encoding of the rendering results. For interactive systems with hard real-time constraints, we combine the algorithm with a multi-frame rate (MFR) system. We suggest a combination of forward and backward image warping to improve the visual quality of the MFR rendering. We observed the runtime behavior of the suggested methods and assessed how their performance scales with respect to input and output resolutions and the number of GPUs. By using additional GPUs, we reduced rendering times by up to 60%. Multi-frame rate viewing can even be ten times faster.
    Eurographics Symposium on Parallel Graphics and Visualization (EGPGV'12), 2012
  • J03
    Ray Prioritization Using Stylization and Visual Saliency
    Markus Steinberger, Bernhard Kainz, Stefan Hauswiesner, Rostislav Khlebnikov, Denis Kalkofen, Dieter Schmalstieg:
    Ray Prioritization Using Stylization and Visual Saliency
    Abstract: This paper presents a new method to control scene sampling in complex ray-based rendering environments. It proposes to constrain image sampling density with a combination of object features, which are known to be well perceived by the human visual system, and image space saliency, which captures effects that are not based on the object’s geometry. The presented method uses Non- Photorealistic Rendering techniques for the object space feature evaluation and combines the image space saliency calculations with image warping to infer quality hints from previously generated frames. In order to map different feature types to sampling densities, we also present an evaluation of the object space and image space features’ impact on the resulting image quality. In addition, we present an efficient, adaptively aligned fractal pattern that is used to reconstruct the image from sparse sampling data. Furthermore, this paper presents an algorithm which uses our method in order to guarantee a desired minimal frame rate. Our scheduling algorithm maximizes the utilization of each given time slice by rendering features in the order of visual importance values until a time constraint is reached. We demonstrate how our method can be used to boost or stabilize the rendering time in complex ray- based image generation consisting of geometric as well as volumetric data.
    Computers & Graphics, 2012
  • C03
    Display-Adaptive Window Management for Irregular Surfaces
    Manuela Waldner, Raphael Grasset, Markus Steinberger, Dieter Schmalstieg:
    Display-Adaptive Window Management for Irregular Surfaces
    Abstract: Current projectors can easily be combined to create an everywhere display, using all suitable surfaces in offices or meeting rooms for the presentation of information. However, the resulting irregular display is not well supported by tradi tional desktop window managers, which are optimized for rectangular screens. In this paper, we present novel display-adaptive window management techniques, which provide semi-automatic placement for desktop elements (such as windows or icons) for users of large, irregularly shaped displays. We report results from an exploratory study, which reveals interesting emerging strategies of users in the manipulation of windows on large irregular displays and shows that the new techniques increase subjective satisfaction with the window management interface.
    Interactive Tabletops and Surfaces (ITS'11), 2011
  • J02
    Context-Preserving Visual Links
    Markus Steinberger, Manuela Waldner, Marc Streit, Alexander Lex, Dieter Schmalstieg:
    Context-Preserving Visual Links
    Abstract: Evaluating, comparing, and interpreting related pieces of information are tasks that are commonly performed during visual data analysis and in many kinds of information-intensive work. Synchronized visual highlighting of related elements is a well-known technique used to assist this task. An alternative approach, which is more invasive but also more expressive is visual linking in which line connections are rendered between related elements. In this work, we present context-preserving visual links as a new method for generating visual links. The method specifically aims to fulfill the following two goals: first, visual links should minimize the occlusion of important information; second, links should visually stand out from surrounding information by minimizing visual interference. We employ an image-based analysis of visual saliency to determine the important regions in the original representation. A consequence of the image-based approach is that our technique is application-independent and can be employed in a large number of visual data analysis scenarios in which the underlying content cannot or should not be altered. We conducted a controlled experiment that indicates that users can find linked elements in complex visualizations more quickly and with greater subjective satisfaction than in complex visualizations in which plain highlighting is used. Context-preserving visual links were perceived as visually more attractive than traditional visual links that do not account for the context information.
    InfoVis '11 Best Paper Award
    IEEE Transactions on Visualization and Computer Graphics (InfoVis '11), 2011

  • C02
    Stylization-based ray prioritization for guaranteed frame rates
    Bernhard Kainz, Markus Steinberger, Stefan Hauswiesner, Rostislav Khlebnikov, Dieter Schmalstieg:
    Stylization-based ray prioritization for guaranteed frame rates
    Abstract: This paper presents a new method to control graceful scene degradation in complex ray-based rendering environments. It proposes to constrain the image sampling density with object features, which are known to support the comprehension of the three-dimensional shape. The presented method uses Non-Photorealistic Rendering (NPR) techniques to extract features such as silhouettes, suggestive contours, suggestive highlights, ridges and valleys. To map different feature types to sampling densities, we also present an evaluation of the features impact on the resulting image quality. To reconstruct the image from sparse sampling data, we use linear interpolation on an adaptively aligned fractal pattern. With this technique, we are able to present an algorithm that guarantees a desired minimal frame rate without much loss of image quality. Our scheduling algorithm maximizes the use of each given time slice by rendering features in order of their corresponding importance values until a time constraint is reached. We demonstrate how our method can be used to boost and guarantee the rendering time in complex ray-based environments consisting of geometric as well as volumetric data.
    NPAR '11 Best Paper Award in Rendering
    Non-photorealistic Animation and Rendering (NPAR '11), 2011

  • J01
    Importance-Driven Compositing Window Management
    Manuela Waldner, Markus Steinberger, Raphael Grasset, Dieter Schmalstieg:
    Importance-Driven Compositing Window Management
    Abstract: In this paper we present importance-driven compositing window management, which considers windows not only as basic rectangular shapes but also integrates the importance of the windows' content using a bottom-up visual attention model. Based on this information, importance-driven compositing optimizes the spatial window layout for maximum visibility and interactivity of occluded content in combination with see-through windows. We employ this technique for emerging window manager functions to minimize information overlap caused by popping up windows or floating toolbars and to improve the access to occluded window content. An initial user study indicates that our technique provides a more effective and satisfactory access to occluded information than the well-adopted Alt+Tab window switching technique and see-through windows without optimized spatial layout.
    CHI '11 Honorable Mention Award
    Human Factors in Computing Systems (CHI '11), 2011

  • P01
    Using Perceptual Features to Prioritize Ray-based Image Generation
    Bernhard Kainz, Markus Steinberger, Stefan Hauswiesner, Rostislav Khlebnikov, Denis Kalkofen, Dieter Schmalstieg:
    Using Perceptual Features to Prioritize Ray-based Image Generation
    Abstract: A common challenge in interactive image generation is maintaining high interactivity of the applications that use computationally demanding rendering algorithms. This is usually achieved by sacrificing some of the image quality in order to decrease the rendering time. Most of such algorithms achieve interactive frame rates while trying to preserve as much image quality as possible by applying the reduction steps non-uniformly. However, high-end rendering systems, such as those presented by Parker et al. in 2010, aim to generate highly realistic images of very complex scenes. In such systems ordinary sampling approaches often give visually unacceptable results. In order to allow to optimize the ratio between the sampling rate of the scene and its resulting perceptual quality, we present a new sampling strategy which uses the information about object features that are known to support the comprehension of 3D shape. We control the sampling density by exploiting line extraction techniques commonly used in Non-Photorealistic Renderings.
    Symposium on Interactive 3D Graphics and Games 2011 (I3D), 2011
  • C01
    Wavelet-based Multiresolution Isosurface Rendering
    Markus Steinberger, Markus Grabner:
    Wavelet-based Multiresolution Isosurface Rendering
    Abstract: We present an interactive rendering method for isosurfaces in a voxel grid. The underlying trivariate function is represented as a spline wavelet hierarchy, which allows for adaptive (view-dependent) selection of the desired level-of-detail by superimposing appropriately weighted basis functions. Different root finding techniques are compared with respect to their precision and efficiency. Both wavelet reconstruction and root finding are implemented in CUDA to utilize the high computational performance of Nvidia's hardware and to obtain high quality results. We tested our methods with datasets of up to 5123 voxels and demonstrate interactive frame rates for a viewport size of up to 1024x768 pixels.
    Eurographics/IEEE VGTC Symposium on Volume Graphics, 2010
  • O01
    Multiresolution Isosurface Rendering
    Markus Steinberger:
    Multiresolution Isosurface Rendering
    Abstract: In this paper we propose a new technique for isosurface rendering of volume data. Medical data visualization, e.g. relies on exact and interactive isosurface renderings. We show how to construct a multi resolution view of the data using bi-orthogonal spline wavelets and how to perform fast rendering using raycasting implemented on the GPU. This approach benefits from the properties of both: the wavelet transform and the reconstruction using three di- mensional splines. The smoothness of surfaces is zoom level independent and data can be compressed to speed up rendering while still being able to show full detail quickly. Ray evaluation is implemented in model space to enable perspective rendering. Due to the fact that the isosurface is not extracted from the data beforehand, the isosurface level as well as the current resolution level can be changed without any further computation.
    CESCG '09 3rd Best Paper Award
    Central European Seminar on Computer Graphics (CESCG '09), 2009