Blog Post

paintbrush-icon

GPU L1, L2, Texture Accesses

…The number of L1 cache banks used depends on the number of texels that must be accessed in parallel…

https://graphics.cs.utah.edu/research/projects/high-order-interpolation/highorderinterpolation.pdf

From “Hardware Adaptive High-Order Interpolation for Real-Time Graphics”, D.Lin et al, HPG, 2021

The A100 GPU includes 40 MB of L2 cache, which is 6.7x larger than V100 L2 cache.The L2 cache is divided into two partitions to enable higher bandwidth and lower latency memory access. Each L2 partition localizes and caches data for memory accesses from SMs in the GPCs directly connected to the partition. This structure enables A100 to deliver a 2.3x L2 bandwidth increase over V100

The larger and faster L1 cache and shared memory unit in A100 provides 1.5x the aggregate capacity per SM compared to V100 (192 KB vs. 128 KB per SM) to deliver additional acceleration for many HPC and AI workloads.

paintbrush-icon

Frame Rate Amplification

paintbrush-icon

What is the point of the metaverse ?

Blog Post

paintbrush-icon

libtorch C++ and libraries 

Blog Post

paintbrush-icon

FFMPEG oft used commands 

Blog Post

paintbrush-icon

PyTorch and CUDA on Windows 

conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

This ensures a pytorch version that is compatible with CUDA, ie, the package name looks like "pytorch-1.7.1-py3.7_cuda102_cudnn7_0" is installed.

Note - the version "pytorch                   1.7.1           py3.7_cuda102_cudnn7_0    pytorch" supports the below compute capabilities - sm_37 sm_50 sm_60 sm_61 sm_70 sm_75 compute_37. To check the compute capability of the GPU, refer to  https://gpupowered.org/mygpu/

Blog Post

paintbrush-icon

Identifying GPU Arch (sm_ ) for CUDA 

A simple web browser-based mechanism to identify the sm version of GPU used on a desktop. This information is required for compiling .cu kernels.

https://gpupowered.org/mygpu/

Blog Post

paintbrush-icon

LEGO GPU - 3dfx Interactive Voodoo 3D accelerator 

User "Bhaal_spawn" has created a fan page for voodoo, a first of its kind 3D accelerator, using LEGO bricks.

https://ideas.lego.com/projects/480e824e-d651-4192-996a-937eb7b4fe98

TPOT Blog Post

paintbrush-icon

TPOT Automated ML pipeline discovery

TPOT is a partial Automated Machine Learning toolkit, that can "discover" pipelines given a data-set, including optimal feature engineering, and the pipeline itself. A detailed comparison with manual tuning is necessary here.

The following code illustrates how TPOT can be employed for performing a simple classification task over the Iris dataset.

from tpot import TPOTClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
import numpy as np

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data.astype(np.float64),
    iris.target.astype(np.float64), train_size=0.75, test_size=0.25, random_state=42)

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_iris_pipeline.py')

Running this code should discover a pipeline (exported as tpot_iris_pipeline.py)

CUDA Blog Post

paintbrush-icon

No kernel image available - Tensorflow 2.3.0

When moving from versions below 2.3.0 to Tensorflow 2.3.0 (rc0/rc2/release) - the below error might be faced.

Non-OK-status: GpuLaunchKernelstatus: Internal: no kernel image is available for execution on the device

.This is because, TF team took the decision to support only compute capability 7.0, to reduce binary sizes of distribution.

This is outlined in the GPU section of the release notes at,

 https://github.com/tensorflow/tensorflow/releases/tag/v2.3.0 - TF 2.3 includes PTX kernels only for compute capability 7.0 to reduce the TF pip binary size. Earlier releases included PTX for a variety of older compute capabilities.

GRAPHICS Blog Post

paintbrush-icon

gl-transitions with C++ and libANGLE

gl-transitions.com provides great special effects for transitions from one surface to another using glsl (ES) shaders. This is targeted for WebGL applications, but thinking about it, why not in native (C++) applications ?

Wrote up this post about how to integrate these shaders directly into native code, using nengl, a wrapper for OpenGLES2 applications. This is using OpenGL ES with EGL context on windows desktops via glfw3 and libANGLE.

Check out the code in github for a Windows application using libANGLE at,

https://github.com/prabindh/nengl

And a more detailed post at,

https://medium.com/@prabindh/using-gl-transitions-for-effects-9e73abfc8fd5

Note - this can be used as is on Linux and other platforms that support OpenGLES2 or OpenGLES3.

CUDA Blog Post

paintbrush-icon

CUDA on WSL2 announced

Windows Subsystem on Linux (WSL2) provides a way to use Linux functionality in Windows itself, by running a Linux Kernel in Windows.

This month, Nvidia and Microsoft announced availability of CUDA API in WSL2, as part of the Insider Preview. This enables CUDA based applications to run in Linux on WSL2, on Windows.

Note: These are command line applications.

More info at,

https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-cuda-in-wsl

https://ubuntu.com/blog/getting-started-with-cuda-on-ubuntu-on-wsl-2

PARABRICKS Blog Post

paintbrush-icon

Parabricks installation error with Docker

If you are using Docker version 19.03.5 and nvidia-docker, the installer.py is not setup to check the GPU installation correctly. This can result in errors below and a failed installation, even if docker works correctly with GPU in other applications/container use-cases.

"docker does not have nvidia runtime. Please add nvidia runtime to docker or install nvidia-docker. Exiting..."

Installer.py requires the changes below for successful installation.

https://github.com/prabindh/parabricks-changes/commit/b39f61b8512240bd8c3e7a903f09326fb029893f

Or the complete file below.

https://github.com/prabindh/parabricks-changes/blob/master/installer.py

Further steps in the germline pipeline work as per documentation,

Steps in the Pipeline:
Alignment of Reads with Reference
Coordinate Sorting
Marking Duplicate BAM Entries
Base Quality Score Calibration of the Sample
Apply BQSR for the Sample
Germline Variant Calling
Read more at,
https://www.parabricks.com/germline/

Blog Post

paintbrush-icon

Genomics with Parabricks

Analysis of Genomic data with Parabricks

NVIDIA PARABRICKS

Analyzing genomic data is computationally intensive. Time and cost are significant barriers to using genomics data for precision medicine.

The NVIDIA Parabricks Genomics Analysis Toolkit breaks down those barriers, providing GPU-accelerated genomic analysis. Data that once took days to analyze can now be done in under an hour. Choose to run specific accelerated tools or full commonly used pipelines with outputs specific to your requirements.
https://www.developer.nvidia.com/nvidia-parabricks

Blog Post

paintbrush-icon

Enabling COVID research with the GPU with OpenMM and Folding@Home

What is the objective of Folding@Home COVID-19 ?
"After initial quality control and limited testing phases, Folding@home team has released an initial wave of projects simulating potentially druggable protein targets from SARS-CoV-2 (the virus that causes COVID-19) and the related SARS-CoV virus (for which more structural data is available) into full production on Folding@home. This initial wave of projects focuses on better understanding how these coronaviruses interact with the human ACE2 receptor required for viral entry into human host cells, and how researchers might be able to interfere with them through the design of new therapeutic antibodies or small molecules that might disrupt their interaction.

How does Folding@Home for COVID-19 work ?
Step1: Download the installer for your platform - at https://foldingathome.org/start-folding/
Step2: Install on your system with default options (including screensaver, start at boot, etc)
Step3: After install, automatically "Folding@Home" client will launch, or you can start it from the Start menu as a Desktop application.
Step4: Configure to use the GPU when idle, or when using it. Configure your username/join a team name.
Locally, an application called "FAHClient.exe" starts, and in case the firewall needs to be enabled for this application to communicate to the Web (for status and data updates), enable it.

What Compute Units are being used ?
Digging deeper into the running execution runtimes, it seems that both the CPU and integrated GPU and Nvidia GPUs are being used on the Laptop, and in both the cases of GPU, OpenCL kernels are being used.

On the Intel CPU, some of the accelerated FFT primitives are being used.
These are initiated by FahCore_22 executables, that are launched as below:
"C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\xx\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/Core_22.fah/FahCore_22.exe -dir 01 -suffix 01 -version 705 -lifeline 5876 -checkpoint 15 -gpu-vendor nvidia -opencl-platform 0 -opencl-device 0 -cuda-device 0 -gpu 0
"C:\Program Files (x86)\FAHClient/FAHCoreWrapper.exe" C:\Users\xx\AppData\Roaming\FAHClient\cores/cores.foldingathome.org/v7/win/64bit/avx/Core_a7.fah/FahCore_a7.exe -dir 00 -suffix 01 -version 705 -lifeline 5876 -checkpoint 15 -np 6

Thoughts and Followup
Looking at the usage of different engines, the Nvidia GPU can be optimised to use CUDA kernels that can potentially provide improved performance in this case. Raised an issue about this and the scheduler at https://github.com/FoldingAtHome/fah-issues/issues/1326

Blog Post

paintbrush-icon

Accelerated Video Encode Decode with Nvidia GPU

Options for effectively using nvenc and nvdec

From this presentation http://on-demand.gputechconf.com/gtc/2018/presentation/s8601-nvidia-gpu-video-technologies.pdf explains the various nvenc and nvdec and pre/post processing options available via the HW engine, and CUDA API. Target 490 fps on a GP104 GPU for a Transcode session.

In addition, various tools to debug the usage, via nvidia-smi dmon command are explained.

Blog Post

paintbrush-icon

SLIDE - (Sub-LInear Deep learning Engine) Rice Univ

Using Locally Sensitive Hashing (LSH) to reduce training time

From this rice university paper at MLSys 2020 https://www.cs.rice.edu/~as143/Papers/SLIDE_MLSys.pdf, the authors present a method for fast training to required levels of accuracy, using LSH, for Extreme classification tasks (ex Amazon670 etc)
LSH was introduced in this paper "LSH-SAMPLING BREAKS THE COMPUTATIONAL CHICKEN-AND-EGG LOOP IN ADAPTIVE STOCHASTIC GRADIENT ESTIMATION
 

  • Without Hugepages, perf drops by 30%
     
  • Without GMA,AVXx,SSE4.x perf drops by another 35%
     
  • Benefits from adaptive sampling of active neurons
     
The code is provided at https://github.com/keroro824/HashingDeepLearning

Blog Post

paintbrush-icon

Avoid Google Colab Disconnect

Tip for preventing Google Colab from disconnecting

From https://www.hackster.io/bandofpv/reading-eye-for-the-blind-with-nvidia-jetson-nano-8657ed
During long operations (ex Training a model), to prevent Google Colab from disconnecting to the server, press Ctrl+ Shift + I to open inspector view. Select the Console tab and enter this:

function ClickConnect(){
console.log("Working");
document.querySelector("colab-toolbar-button#connect").click()
}
setInterval(ClickConnect,60000)

Blog Post

paintbrush-icon

Performance improvement with cudf for the groupby operation

Rapids cudf improvement on large CSV groupby

Test case at - https://github.com/prabindh/deepnotes/blob/master/rapids/cudf-test.py

Selected chunk size: 1000000
Running CPU, size = 1000000
Pandas Groupby Time = 0.5486793518066406
Pandas Groupby Time = 0.0010442733764648438
Pandas Groupby Time = 0.0006716251373291016
Pandas Groupby Time = 0.0006544589996337891
Pandas Groupby Time = 0.0006563663482666016
Running GPU, size = 1000000
Cudf Groupby Time = 0.015185356140136719
Cudf Groupby Time = 0.0007460117340087891
Cudf Groupby Time = 0.0007538795471191406
Cudf Groupby Time = 0.0006649494171142578
Cudf Groupby Time = 0.0006606578826904297
 

Blog Post

paintbrush-icon

DeepOps - Deploying GPU clusters

DeepOps framework for deploying GPU clusters

From https://github.com/NVIDIA/deepops
The DeepOps project encapsulates best practices in the deployment of GPU server clusters and sharing single powerful nodes (such as NVIDIA DGX Systems). DeepOps can also be adapted or used in a modular fashion to match site-specific cluster needs. For example:
 

  • An on-prem, air-gapped data center of NVIDIA DGX servers where DeepOps provides end-to-end capabilities to set up the entire cluster management stack
     
  • An existing cluster running Kubernetes where DeepOps scripts are used to deploy Kubeflow and connect NFS storage
     
  • An existing cluster that needs a resource manager / batch scheduler, where DeepOps is used to install Slurm, Kubernetes, or a hybrid of both
     
  • A single machine where no scheduler is desired, only NVIDIA drivers, Docker, and the NVIDIA Container Runtime
     
A virtual deploy guide is also provided, to test out the deployment on a single machine.

Blog Post

paintbrush-icon

Nvidia-Docker with GPU for Ubuntu 18.04.3

Nvidia-Docker setup on Ubuntu 18.04.3

This link contains steps for installing Nvidia GPU enabled GPU, on Ubuntu 18.04.3
https://github.com/prabindh/deepnotes/blob/master/docker-18.04.3/docker.txt
Output of the nvidia-smi command running in the container should look like below

Blog Post

paintbrush-icon

RAPIDS for the GPU

Rapids support

Rapids framework is available on Linux, most recent being the 0.11 version. Get the framework corresponding to preferences via the configurator at,
https://rapids.ai/start.html
Rapids framework is not available on Windows, and will show the error "PackagesNotFoundError: The following packages are not available from current channels". For the reasoning on why Rapids is not available via pip, due to the manylinux related issue, read more at https://medium.com/rapids-ai/rapids-0-7-release-drops-pip-packages-47fc966e9472

Recent Post

paintbrush-icon

Parsing Tensorflow API documentation

With API changes happening frequently, it becomes important to have the installed API handy and readily available for reading. Is it possible to do it without going online ?

Since Tensorflow documents are generated from existing code, pydoc can be used to perform "man" like commands on the Python terminal following steps below

1. Install pydoc via pip or other tool

2. Invoke pydoc as below on the desired API to be looked up

$ pydoc tensorflow.data.Dataset

This results in below display

"Help on class DatasetV2 in tensorflow.data:
tensorflow.data.Dataset = class DatasetV2(tensorflow.python.training.tracking
.base.Trackable, tensorflow.python.framework.composite_tensor
.CompositeTensor) |
tensorflow.data.Dataset(variant_tensor) |
| Represents a potentially large set of elements."

Blog Post

paintbrush-icon

Enabling OpenCV 4.1.0 pkg-config

By default OpenCV4 does not enable package config (pkg-config pc files) generation anymore. But in 4.1.0 atleast, we can force enabling this during configure as below.

$pkg-config --cflags opencv4
-I/usr/include/opencv4/opencv -I/usr/include/opencv4

$ pkg-config --libs opencv4
-lopencv_aruco -lopencv_bgsegm -lopencv_bioinspired -lopencv_ccalib -lopencv_dnn_objdetect -lopencv_dpm -lopencv_face -lopencv_freetype -lopencv_fuzzy -lopencv_gapi -lopencv_hfs -lopencv_img_hash -lopencv_line_descriptor -lopencv_quality -lopencv_reg -lopencv_rgbd -lopencv_saliency -lopencv_stereo -lopencv_stitching -lopencv_structured_light -lopencv_phase_unwrapping -lopencv_superres -lopencv_optflow -lopencv_surface_matching -lopencv_tracking -lopencv_datasets -lopencv_text -lopencv_dnn -lopencv_plot -lopencv_videostab -lopencv_video -lopencv_xfeatures2d -lopencv_shape -lopencv_ml -lopencv_ximgproc -lopencv_xobjdetect -lopencv_objdetect -lopencv_calib3d -lopencv_features2d -lopencv_highgui -lopencv_videoio -lopencv_imgcodecs -lopencv_flann -lopencv_xphoto -lopencv_photo -lopencv_imgproc -lopencv_core

Blog Post

paintbrush-icon

CUDA, Keras, Tensorflow versions

Current working version that is supportive of recent research works

  • keras - 2.2.4
  • CUDA 10.0 + matching CUDNN
  • tensorflow-gpu - 1.13.1

Post

paintbrush-icon

CenterNet based COCO data-set object detection on Windows

CenterNet uses center-points instead of typical bounds of region of interest. Since the default build is on Linux, this post updates the steps for Windows. link https://github.com/prabindh/deepnotes/tree/master/CenterNet, this is derived from xingyizhou et al , CenterNet. Steps and results of Webcam demo updated. GPU loading is 70% (Quadro1000M) at approx 30 fps using default Python code. Refer GPU-z logs in the same folder.https://github.com/prabindh/deepnotes/blob/master/CenterNet/centernet-GPU-Z%20Sensor%20Log.txt

Windows port of CenterNet https://github.com/prabindh/deepnotes/tree/master/CenterNet

Post

paintbrush-icon

Labelling and Training to detect capacitors in a PCB with Yolo (and Squeezedet) Deep learning framework in 1 hour

One of the most time consuming tasks in object detection using deep learning frameworks like Yolo or Caffe, is the manual labelling.
This post shows how to perform labelling automatically with euclidaug and complete the detection task using Yolo in under one hour of work (including autolabelling), for a 3-class model of electronic capacitors in a PCB (Printed Circuit Board). Methods for Squeezedet (that uses the KITTI output mode of euclidaug since squeezedet uses KITTI format) are also shown.

https://github.com/prabindh/yolo-bins/tree/master/capacito

Post

paintbrush-icon

Yolo on the Tegra Jetson Nano with CUDNN

Binaries for Yolov3, for Nvidia Tegra Nano, based on Ubuntu Linux available in the Jetson Nano Linux image, now available at the repository

https://github.com/prabindh/yolo-bins

Post

paintbrush-icon

Making (and reverse engineering music) with Tensorflow

Magenta and its applications (music transcribing - https://piano-scribe.glitch.me/) seem interesting, for the way the onset events in the music are calculated with LSTM, and how the metrics seem much better than previous sota.

https://magenta.tensorflow.org/

Post

paintbrush-icon

C++ Port of Darknet (of YOLO fame) - CUDA and OpenCL

OpenCV3 failures when working with C based DL frameworks, like DeepNet (Made famous by YOLO - http://pjreddie.com/darknet/yolo/) is a common issue.
Here is the latest version of Darknet, ported to C++, fixing many coding bugs along the way. Work involved primarily encapsulation of APIs with C linkages, including undefined headers, bug fixes, and typecasting various allocations to actual types, and using correct Error detection types for CUBLAS. With a port to OpenCL by myestro.
For training with own dataset, and detection, refer to the updated README at,
machine learningyolocaffedarknetc++
Read more about C++ Port of Darknet (of YOLO fame) 0 Comments

Post

paintbrush-icon

GFX2017 Graphics Workshop completed

GFX2017 Graphics workshop was completed on Apr 29th, 2017. Report at the IEEE Site,
link 0 Comments

Post

paintbrush-icon

Introducing Euclid and Euclidaug, a labeller and augment tool for image-datasets

Euclid is a tool for manual labelling of data - sets, such as those found in Deep learning systems that employ Caffe, systems like Tensorflow, SqueezeDet, and YOLO. It is an object / class labelling tool for machine learning frameworks, with applications in Road sign detection, Animal detection, Retail, Defense machinery. A typical usage is as in the IEEE paper "On the Applicability of Deep Learning for Road Signal Recognition", by Vinicios R. Soares et al- https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8627071
https://github.com/prabindh/euclid/
This tool runs on Linux and Windows, and is based on Python

Label format support:
digitsdetectnetcaffelabellingeuclid
Read more about Introducing Euclid, a labeller for image-datasets 0 Comments

Integrating Darknet/Yolo and OpenCV3, with Qt5

Submitted by prabindh on Sun, 01/08/2017 - 19:05 / /
Just added a shared-library port of latest Darknet/Yolo framework, that enables easy integration into other frameworks like Qt5.
An example Qt5 application, with OpenCV3, and Darknet is built in below repository.
https://github.com/prabindh/qt5-opencv3-darknet
qt5yoloopencv3
Read more about Integrating Darknet/Yolo and OpenCV3, with Qt5 0 Comments

Post

paintbrush-icon

Impact of Qualcomm-NXP-Freescale on the GPU Ecosystem

The proposed Qualcomm-Nxp-Freescale merger brings a new dimension in terms of GPU variants in the new entity - we have (1) The Vivante GC2000, GC880 series (IMX5,6), (2) The Adreno (erstwhile) Z series, and (3) Qualcomm's Adreno 3 series, Adreno 4 series, and Adreno 5 series.

How do they compare and who is going to win ? Read more at this linked in post

Post

paintbrush-icon

Khronos Chapter Inaugurated

The Khronos chapter at Bangalore was inaugurated recently with participation from key companies - Samsung, Nvidia, AMD, TI, and many more startups and established companies. Read more at this Samsung page, and in this Khronos page . Panel discussion on how Khronos chapter can proceed further in coming years at this Khronos Youtube link

Post

paintbrush-icon

Origins

GPUPowered.Org was started as a WebGL experiment in 2010-11, when WebGL was still in its early stages. The tutorials setup has been used in various presentations every year. Ref http://ewh.ieee.org/r10/bangalore/ces/