INTEGRATION OF SEGMENTATION, TRACKING AND CLASSIFICATION MODELS TO SOLVE VIDEO ANALYTICS PROBLEMS

  • А.Е. Arkhipov Russian State Scientific Center for Robotics and Technical Cybernetics (RTC)
  • I.S. Fomin Russian State Scientific Center for Robotics and Technical Cybernetics (RTC)
  • V.D. Matveev Russian State Scientific Center for Robotics and Technical Cybernetics (RTC)
Keywords: Neural networks, segmentation, tracking, classification, video analytics, computer vision

Abstract

The integration of several models into one technical vision system will allow solving more
complex tasks. In particular, for mobile robotics and unmanned aerial vehicles (UAVs), the lack of
data sets for various conditions is an urgent problem. In the work, the integration of several models
is proposed as a solution to this problem: segmentation, maintenance and classification. The segmentation
model allows you to select arbitrary objects from frames, which allows it to be used in nondeterministic
and dynamic environments. The classification model allows you to determine the objects necessary for navigation or other use, which are then accompanied by a third model. The paper
describes an algorithm for model aggregation. In addition to models, the key element is the correction
of model predictions, which allows you to segment and accompany various objects reliably
enough. The procedure for correcting model predictions solves the following tasks: adding new objects
to accompany, validating segmented object masks and clarifying the associated masks. The
versatility of this solution is confirmed by working in difficult conditions, for example, underwater
photography or images from UAVs. An experimental study of each of the models was carried out in
an open area and indoors. The data sets used make it possible to assess the applicability of models
for mobile robotics tasks, that is, to identify possible obstacles in the robot's path, for example, a
curb, as well as moving objects such as a person or a car. They demonstrated a sufficiently high
quality of work. For most classes, the indicators exceeded 80% by various metrics. The main errors
are related to the size of the objects. The conducted experiments clearly demonstrate the versatility of
this solution without additional training of models. Additionally, a study of performance on a personal
computer with various input parameters and resolution was conducted. Increasing the number of
models significantly increases the computational load and does not reach real time. Therefore, one of
the directions of further research is to increase the speed of the system

References

1. Yang J., et al. Track anything: Segment anything meets videos, CoRR, 2023, Vol. abs/2304.11968.
Available at: http://arxiv.org/abs/2304.11968.
2. Cheng H.K., et al. Tracking anything with decoupled video segmentation, IEEE/CVF International
Conference on Computer Vision, 2023, pp. 1316-1326.
3. Zhu J., et al. Tracking anything in high quality, CoRR, 2023, Vol. abs/2307.13974. Available
at: http://arxiv.org/abs/2307.13974.
4. Liu Y., et al. MobileSAM-Track: Lightweight One-Shot Tracking and Segmentation of Small
Objects on Edge Devices, Remote Sensing, 2023, Vol. 15, No. 24, pp. 5665.
5. Cheng Y., et al. Segment and track anything, CoRR, 2023, Vol. abs/2305.06558. Available at:
http://arxiv.org/abs/2305.06558.
6. Kirillov A., et al. Segment anything, CoRR, 2023, Vol. abs/2304.02643. Available at:
http://arxiv.org/abs/2304.02643
7. Cheng B., Schwing A., Kirillov A. Per-pixel classification is not all you need for semantic segmentation,
Advances in Neural Information Processing Systems, 2021, Vol. 34, pp. 17864-17875.
8. Yang Z., Yang Y. Decoupling features in hierarchical propagation for video object segmentation,
Advances in Neural Information Processing Systems, 2022, Vol. 35, pp. 36324-36336.
9. Yang Z., Wei Y., Yang Y. Associating objects with transformers for video object segmentation,
Advances in Neural Information Processing Systems, 2021, Vol. 34, pp. 2491-2502.
10. Cherti M., et al. Reproducible scaling laws for contrastive language-image learning,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2818-2829.
11. Awadalla A., et al. Openflamingo: An open-source framework for training large autoregressive
vision-language models, CoRR, 2023, Vol. abs/2308.01390. Available at: http://arxiv.org/abs/
2308.01390.
12. Li J., Li D., Xiong C., Hoi S. Blip: Bootstrapping language-image pre-training for unified vision-
language understanding and generation, International Conference on Machine Learning,
2022, pp. 12888-12900.
13. Radford A., et al. Learning transferable visual models from natural language supervision, International
conference on machine learning, 2021, pp. 8748-8763.
14. Mueller M., Smith N., Ghanem B. A benchmark and simulator for uav tracking //Computer
Vision–ECCV 2016: 14th European Conference. 2016. – P. 445-461.
15. Github: fbrs_interactive_segmentation. Available at: https://github.com/SamsungLabs/fbrs_
interactive_segmentation.
16. Sofiiuk K., Petrov I., Barinova O., Konushin A. F-BRS: Rethinking Backpropagating Refinement
for Interactive Segmentation, IEEE/CVF Conference on Computer Vision and Pattern
Recognition, 2020, pp. 8623-8632.
17. Fomin I., Arhipov A. Selection of Neural Network Algorithms for the Semantic Analysis of
Local Industrial Area, International Russian Automation Conference, 2021, pp. 380-385.
18. Miao J., et al. VSPW: A Large-scale Dataset for Video Scene Parsing in the Wild, IEEE/CVF
Conference on Computer Vision and Pattern Recognition, 2021, pp. 4133-4143.
19. Zhang C., et al. Faster Segment Anything: Towards Lightweight SAM for Mobile Applications,
CoRR, 2023, Vol. abs/2306.14289. Available at: http://arxiv.org/abs/2306.14289.
20. Wang A., et al. RepViT-SAM: Towards Real-Time Segmenting Anything, CoRR, 2023,
Vol. abs/2312.05760 Available at: http://arxiv.org/abs/2312.05760.
Published
2024-04-16
Section
SECTION IV. TECHNICAL VISION