O slideshow foi denunciado.
Seu SlideShare está sendo baixado. ×

Computer Vision Landscape : Present and Future

Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Anúncio
Carregando em…3
×

Confira estes a seguir

1 de 34 Anúncio

Computer Vision Landscape : Present and Future

Baixar para ler offline

Millions of people all around the world Learn with Chegg. Education at Chegg is powered by the depth and diversity of the content that we have. A huge part of our content is in form of images. These images could be uploaded by students or by content creators. Images contain text that is extracted using a transcription service. Very often uploaded images are noisy. This leads to irrelevant characters or words in the transcribed text. Using object detection techniques we develop a service that extracts the relevant parts of the image and uses a transcription service to get clean text. In the first part of the presentation, I will talk about building an object detection model using YOLO for cropping and masking images to obtain a cleaner text from transcription. YOLO is a deep learning object detection and recognition modeling framework that is able to produce highly accurate results with low latency. In the next part of my presentation, I will talk about the building the Computer Vision landscape at Chegg. Starting from images on academic materials that are composed of elements such as text, equations, diagrams we create a pipeline for extracting these image elements. Using state of the art deep learning techniques we create embeddings for these elements to enhance downstream machine learning models such as content quality and similarity.

Millions of people all around the world Learn with Chegg. Education at Chegg is powered by the depth and diversity of the content that we have. A huge part of our content is in form of images. These images could be uploaded by students or by content creators. Images contain text that is extracted using a transcription service. Very often uploaded images are noisy. This leads to irrelevant characters or words in the transcribed text. Using object detection techniques we develop a service that extracts the relevant parts of the image and uses a transcription service to get clean text. In the first part of the presentation, I will talk about building an object detection model using YOLO for cropping and masking images to obtain a cleaner text from transcription. YOLO is a deep learning object detection and recognition modeling framework that is able to produce highly accurate results with low latency. In the next part of my presentation, I will talk about the building the Computer Vision landscape at Chegg. Starting from images on academic materials that are composed of elements such as text, equations, diagrams we create a pipeline for extracting these image elements. Using state of the art deep learning techniques we create embeddings for these elements to enhance downstream machine learning models such as content quality and similarity.

Anúncio
Anúncio

Mais Conteúdo rRelacionado

Semelhante a Computer Vision Landscape : Present and Future (20)

Mais de Sanghamitra Deb (14)

Anúncio

Mais recentes (20)

Computer Vision Landscape : Present and Future

  1. 1. Computer Vision Landscape : Present and Future Sanghamitra Deb Staff Data Scientist Chegg Inc Data Day Texas, 2023
  2. 2. Outline • Images • Enhanced Transcription o Data Story o Computer Vision model o Metrics o Deployment • Computer Vision Landscape • Image Embeddings
  3. 3. Images Disclaimer: Images are replica’s representing real scenarios
  4. 4. Enhanced Transcription Computer Vision Model Transcription Service {”text”:”Resonant ocean thicknesses at different forcing frequencies. (a) Location of Europa's first three largest resonant rotational-gravity modes as a function of forcing frequency and ocean thickness, for both zonal (m = 0) and sectoral (m = 2) degree-2 modes…..”} Reference paper: https://agupubs.onlinelibrary.wiley.com/doi/10.1029/2020GL088317
  5. 5. Data Story Version 1 • Collect Data on cropped images • Build object Detection Model • Measure performance. Version 3 Version 2
  6. 6. Version 1
  7. 7. Data Story Version 1 • Collect Data on cropped images • Build object Detection Model • Measure performance. Performance : Not good enough Lessons learned: CV models cannot read, unless objects are well defined and distinct detection has a lot of errors Version 3 Version 2
  8. 8. Data Story Version 1 • Collect Data on cropped images • Build object Detection Model • Measure performance. Performance : Not good enough Lessons learned: CV models cannot read, unless objects are well defined and distinct detection has a lot of errors Version 3 Version 2 Redefine problem --- Detect bounding boxes for • Text • Equations • Diagrams and Charts • UI Elements • Tables Performance was good. This model is currently in production
  9. 9. Data Story Version 1 • Collect Data on cropped images • Build object Detection Model • Measure performance. Performance : Not good enough Lessons learned: CV models cannot read, unless objects are well defined and distinct detection has a lot of errors Version 2 Redefine problem --- Detect bounding boxes for • Text • Equations • Diagrams and Charts • UI Elements • Tables Performance was good. This model is currently in production Redefine problem --- Downstream applications need the text that was getting cropped out. • Header Region • Side Region • Footer Region • Question Region • UI Elements Version 3
  10. 10. Data Story Redefine problem --- Detect bounding boxes for • Text • Equations • Diagrams and Charts • UI Elements • Tables Performance was good. This model is currently in production Version 1 Version 3 Version 2 • Collect Data on cropped images • Build object Detection Model • Measure performance. Performance : Not good enough Lessons learned: CV models cannot read, unless objects are well defined and distinct detection has a lot of errors Redefine problem --- Downstream applications need the text that was getting cropped out. • Header Region • Side Region • Footer Region • Question Region • • UI Elements • Text • Equations • Diagrams & Charts • Tables
  11. 11. Enhanced Transcription: Version 2 We are extracting Bounding Boxes. • Text • Equations • Diagrams and Charts • UI Elements • Tables Tables Text
  12. 12. Enhanced Transcription: Version 2 Equations UI Elements Diagrams and Charts
  13. 13. Building Object Detection Model: Training Pipeline
  14. 14. What is object Detection
  15. 15. Metrics: Intersection over Union Predictions: Bounding Boxes (BB), classification labels. IOU is computed for each bounding box
  16. 16. Metrics: mAP@iou=0.5 Metrics are computed for a given IOU threshold. For a prediction, we may get different binary TRUE or FALSE positives, by changing the IoU threshold. Average precision is computed for each class for a threshold of 0.5. mAP is the mean across all classes. mAP@iou=0.5 >=0.8
  17. 17. Collecting Training Data: LabelBox Retrieve archival images . Create annotation project. Write annotation guide. Make sure 5- 10% of the data is reviewed for quality checks. Look for inter-annotator agreement for a small dataset Collect labelled data. Do some spot checks for annotation quality
  18. 18. Object Detection Models
  19. 19. Region-based Convolutional Neural Networks (R- CNN) Cons: Very slow --- propagating thousand’s of RP’s through CNN & classifier takes a very long time
  20. 20. Vanishing/Exploding Gradients Operation --- multiplying n small / large numbers to compute gradients of the “front” layers in an n-layer network When the network is deep, multiplying n small numbers will become zero (vanished). When the network is deep, multiplying n large numbers will become too large (exploded).
  21. 21. Resnet-2015 Right: Regular CNN, Left: fit some residual , instead of the desired function H(X) directly. A skip / shortcut connection is added to the input x along with the output after few weight layers Layers can be stacked to be 150 layers deep
  22. 22. Plain Network vs RESNET
  23. 23. YOLO (You Only Look Once) Unified Detection --- • Uses features from the entire image for prediction • Predicts Bounding boxes across all classes simultaneously. • Bounding boxes and classes are predicted in one shot, i.e by the same network. Divide input into grids class probability map Final detections
  24. 24. Yolo v5 network
  25. 25. Why Yolo? o Faster Speed: YOLO algorithms works comparatively faster as compared to other algorithm. Smaller model is able to process 155 frames per second. o Accurary: State of art performance on several Object Detection datasets including COCO. o Open source code is available in multiple deep learning frameworks. o Code is well developed and easy to use. Limitations: small objects that are grouped together do not have good recall
  26. 26. Yolo v5 Pytorch codebase https://github.com/ultralytics/yolov5 Lets look into the repo python train.py --img 640 --batch 16 --epochs 3 --data coco128.yaml --weights yolov5s.pt Model size Batch size python detect.py --weights yolov5s.pt --source image.jpg
  27. 27. Deployment Load Pytorch model & predict Bounding Boxes Crop image with Bounding Box output Send cropped image to transcription service API output: {Transcribed text, Bounding box } Version 2 Version 3
  28. 28. Measuring effectiveness of the Enhanced Transcription Annotation Task: Labelbox Which Transcription is better?
  29. 29. Improves Coverage If the entire image was send to the transcription service more than 5% of the images returned “no content found”. Cropping the image using object detection removes low quality surrounding elements, this facilitates recovery of transcription for 2.7% of images
  30. 30. Computer Vision Landscape
  31. 31. Diagram Embeddings Pulley diagram Newton’s second law Friction acceleration Moment of Inertia Extract diagram embeddings from pre-trained modes such as Resnet. Use case • Similarity based applications --- recommendation systems. • Converting general predictive model into multimodal models with text , image and structured data features. • Categorizing diagrams and creating a diagram ontology to create rich metadata.
  32. 32. Takeaways o Computer Vision models can see but they cannot read. o Doing a deepdive on metrics ahead of building the model is a good practice. o YOLO performs well out of the box. Its open source and readily available with very low latency. o Building service combining outputs from external vendors requires careful load testing. o Having a vision beyond immediate deliverables creates avenues for overall enrichment of ML products.
  33. 33. Thank You @sangha_deb sdeb@chegg.com
  34. 34. References • Computer Vision Models : https://medium.com/augmented-startups/top-6-object-detection-algorithms-b8e5c41b952f. https://www.v7labs.com/blog/yolo-object-detection#h1 • https://towardsdatascience.com/map-mean-average-precision-might-confuse-you-5956f1bfa9e2 • R-FCN : https://arxiv.org/pdf/1605.06409.pdf • YOLOV5 - https://arxiv.org/pdf/2108.11539.pdf

Notas do Editor

  • Classification and Localization --- Done using regression
  • Selective search. --- extract several thousand region proposals.
    Each of these region proposals (RP) is labeled with a class and a ground-truth bounding box.
    A pre-trained CNN is used to extract features for the region proposals through forward propagation.
    These features are used to predict the class and bounding box of this region proposal using SVMs and linear regression.


    ROI pooling is followed by fully connected (FC) layers for classification and bounding box regression. The FC layers after ROI pooling do not share among different ROIs and take time. This makes R-CNN approaches slow, and the fully connected layers have a large number of parameters.

    Fast R-CNN performs the CNN forward propagation once on the entire image.
    Faster R-CNN reduces the total number of region proposals by using a region proposal network(RPN) instead of selective search to further improve the speed.

  • Yolo reasons globally about the full image …


    YOLO models treat object detection as a regression problem. It divides the image into an S × S grid and for each grid cell predicts B bounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S × S × (B ∗ 5 + C) tensor.
  • The details of the architecture are beyond the scope of this presentation. YOLO V5 HAS improvements in data augmentation compared to previous models. Resnet is one of the backbones used for the architecture for extracting features. Transformers are used in the prediction head. Predictions from multiple heads are ensembled using techniques such as non-max suppression to predict the bounding boxes.
    Additionally a resnet model is trained using image patches cropping from training data as classification training set.
  • Test for I/ contract
    Send images that have no text and check for the output.

    Make sure there is logging

×