Background

MobileDrive

Yolov5

Backbone: Use to extract feature map with different resolutions Head: Input feature map and predict the boundary region and confidence

Model optimization pipeline

Perform NAS and Pruning on YOLOv5’s backbone architecture
Map optimized backbone architecture back to YOLOv5
Apply KD and Quantization on the whole network

NAS

Goal: Automatically search a network architecture that leads to the best performance

Architecture:

Blocks: bottleneck block, inception block, residual block, etc.
Layers: convs, pooling, fc, etc.
Hyperparameters: number of filters, size of kernel, stride, padding, etc.

Search space: The set containing all the possible architectures

One Shot NAS

SuperNet: A huge network containing all search space candidates Pros: Much faster

Pruning

Channel Pruning

Goal: Aim to remove less important channels while minimizing the accuracy loss
Ranking algorithm: To rank the importance of every channel

# Get the weights of a specific convolutional layer
conv_weights = model.layers[1].get_weights()[0]

# Calculate the L2 norm of each filter
filter_l2_norms = np.linalg.norm(conv_weights, axis=(0, 1, 2))

# Determine a threshold to prune the least important filters
threshold = np.percentile(filter_l2_norms, 10)  # Prune the bottom 10%

# Create a mask to determine which filters to keep
filters_to_keep = filter_l2_norms > threshold

# Prune the filters and their corresponding feature maps
pruned_weights = conv_weights[:, :, :, filters_to_keep]

# Replace the original weights with the pruned weights
model.layers[1].set_weights([pruned_weights]) 

KD

Teacher (large, accurate model) Student (small, efficient model) To Improve student’s accuracy (better than training from scratch)

Paper

Problem

Most existing methods lack attention to affective meaning in group dynamics and fail to acount for the contextual relevance of faces and objects in group-level images.

Motivation

Using all MIP produce worse results.
Using MIP only is not enough (tell the story).
Not all patches are needed. Complex background (crowded images) requires removing uninformation tokens.

Proposal

First work introducing MIP into the group affect task and validate that MIP plays a crucial role in group affect recognition.
The MIP and global affective context information are integrated into the proposed dual-pathway vit architecture.

Method

Dual-pathway learning

Both the global and MIP image are tokenized into patches
Then class token and a learnable position embedding are added to both branches.

Transformer Encoder

The global branch is the large (primary) branch with a coarse patch size, with a larger embedding size, and more transformer encoders.
The MIP image is the input of a small (complementary) branch with fine-grained patch size (i.e., 16), fewer encoders, and a smaller embedding size.

Token Ranking Module

To remove unimportant patches.
Denote the token importance by the similarity scores between the global class token and each patch token.

CPA

Based on the importance score of each token, we then costruct a newly selected query matrix by selecting the top query vectors.

Class Token and Positional Encoding

Intro to Class token and positional embedding Why use class token?

Presentation slide

CT MRI Volume Rendering

Getting Started with Volume Rendering using OpenGL - CodeProject

Store multiple textures with different z axis.
Do transparency, set alpha to 0 for some points that have alpha smaller than the threshold
Do blending and disable depth test.

Rotation issue

when the model is rotating by z axis, 上下顛倒
when the model is rotating 90 or 270 degree, the image wil disappear since there are no enough textures

Neural Radiance Field

Method

To represents a scene using a fully-connected (non-convolutional) deep network.
Input is a single continuous 5D coordinate (spatial location (x, y, z) and camera viewing direction (θ, φ)) theta_and_pi.
Output is the volume density and view-dependent emitted radiance at that spatial location (r, g, b).
Querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image.
Volume density is unrelated to view angles and only rely on coordinates. Therefore, do two passes in the neural network. The first pass recieve coordinates as input and output volume density and an intermediate value for rgb. Then, the second pass recieve the intermediate value and viewing direction to produce final rgb. {width=”60%”}

Ways to improve for synthesizing novel angles

Add noise while training. (ex: Input coordinate perturbation, point sampling)
Utilize image features (embedded by CNN networks).
Model pretrained on similar categories.

Coach AI

Youtube

Model Selection: Random forests are versatile and robust classifiers that can handle complex relationships between features and labels. They are less prone to overfitting and can handle high-dimensional data well, which is common in pose prediction tasks with many key points. Additionally, random forests can provide feature importances, which can help understand which key points are most informative for predicting specific poses.