Skill Level: Beginner


  1. Why Deep Learning in AI ?

    ImageNet challenge: It is Olympics of computer vision!, Every year, researchers attempt to classify images into one of 200 possible classes given a training dataset of approximately 450,000 images.
    The goal of the competition is to push the state of the art in computer vision to rival the accuracy of human vision itself (approximately 95– 96%).
    In 2012, Alex Krizhevsky at the University of Toronto did the unthinkable. Pioneering a deep learning architecture known as a convolutional neural network for the first time on a challenge of this size and complexity, he blew the competition out of the water. The runner up in the competition scored a commendable 26.1% error rate. But AlexNet, over the course of just a few months of work, completely crushed 50 years of traditional computer vision research with an error rate of approximately 16%

  2. What is the Deep Learning Tools?

    PyTorch is a machine learning and deep learning tool developed by Facebook’s artificial intelligence division to process large-scale image analysis, including object detection, segmentation and classification.
    the other available tools are TensorFlow (developed by google), Theano (by University of Montreal), Caffe, Neon, and Keras.

    Google announced TensorFlow 2.0 in June 2019, they declared that Keras is now the official high-level API of TensorFlow for quick and easy model design and training.

    Most of 2019 researchs use PyTorch while most of production products use TensorFlow.

  3. TensorFlow vs PyTorch

    • In TensorFlow, we have to define the tensors, initialize the session, and keep placeholders for the tensor objects; however, we do not have to do these operations in PyTorch.

    • In TensorFlow, let’s consider sentiment analysis as an example. Input sentences are tagged with positive or negative tags. If the input sentence’s length is not equal, then we set the maximum sentence length and add zero to make the length of other sentences equal, so that the recurrent neural network can function; however, this is a built-in functionality in PyTorch, so we do not have to define the length of the sentences.

    • In PyTorch, the debugging is much easier and simpler, but it is a difficult task in TensorFlow.


    In 2019, PyTorch has the research market, and is trying to extend this success to industry. TensorFlow is trying to stem its losses in the research community without sacrificing too much of its production capabilities.


    Why industry use TensorFlow instead of PyTorch?

    No Python. Some companies will run servers for which the overhead of the Python runtime is too much to take.
    Mobile. You can’t embed a Python interpreter in your mobile binary.
    Serving. A catch-all for features like no-downtime updates of models, switching between models seamlessly, batching at prediction time, and etc.

    TensorFlow was built specifically around industry requirements, and has solutions for all these issues: the graph format and execution engine natively has no need for Python, and TensorFlow Lite and TensorFlow Serving address mobile and serving considerations respectively.


    How TensorFlow and PyTorch address there weaknesses?

    PyTorch introduced the JIT compiler: support deploy PyTorch models in C++ without a Python dependency, also announced support for both quantization and mobile.

    TensorFlow moving to eager mode in v2.0 : At the API level, TensorFlow eager mode is essentially identical to PyTorch’s eager mode.
    This gives TensorFlow most of the advantages of PyTorch’s eager mode (ease of use, debuggability, and etc.)
    However, this also gives TensorFlow the same disadvantages. TensorFlow eager models can’t be exported to a non-Python environment, they can’t be optimized, they can’t run on mobile, etc.
    This puts TensorFlow in the same position as PyTorch.
    But TensorFlow Eager suffers heavily from performance/memory issues till now.


  4. How to install PyTorch?

    pip3 install torch torchvision


    pip3 install torch==1.3.0+cpu torchvision==0.4.1+cpu -f https://download.pytorch.org/whl/torch_stable.html



    to get the latest installation command visit https://pytorch.org/

    form the blow image, user will choose his OS and Python version to get the installation command


  5. Deep Learning core concepts

    Neural General Function

    Let’s reformulate the inputs as a vector x = [x1 x2 … xn] and the weights of the neuron as w = [w1 w2 … wn].
    Then we can re-express the output of the neuron as

    y=f(x*w + b) , where b is the bias term.

    In order to learn complex relationships, we need to use neurons that employ some sort of nonlinearity. There are three major types of neurons (Activation function) that are used in practice that introduce nonlinearities in their computations (Sigmoid, Tanh, and ReLU Neurons).

    Activation function (or non-linearity)
    1) Sigmoid neurons: S-shaped nonlinearity, takes a real-valued input and the output range from 0 to 1

    σ(x) = 1 / (1 + exp(−x))

    2) Tanh neurons: S-shaped nonlinearity,  takes a real-valued input, and the output range from −1 to 1

    tanh(x) = 2σ(2x) − 1

    3) ReLU (Rectified Linear Unit), It takes a real-valued input and thresholds it at zero (replaces negative values with zero)

    f(x) = max(0, x)

  6. Softmax Output Layers

    it will show all output labels and how confident we are in our predictions. the output depends on the outputs of all the other neurons.
    So, if the input image ask if the content is dog or cat,
    softmax layers at the end may answer with 0.9 cat, 0.1 dog

  7. Gradient Descent

    In neural network, how exactly do we figure out the weights for each node  in neural network?
    This is accomplished by training


    t(i) is the true answer for the (i)th training example
    y(i) is the value computed by the neural network,

    we want to minimize the value of the error function E



    E is zero when our model makes a perfectly correct prediction on every training example. Moreover, the closer E is to 0, the better our model is.
    As a result, our goal will be to select our parameter vector θ (the values for all the weights in our model) such that E is as close to 0 as possible.

    use gradient descent algorithm to minimize the squared error over all of the training examples.

    But gradient descent algorithm may not solve the problem if we have many local minimum

    Should Local minima solved  and find true global minimum?
    No, in most cases, no need to overcome Local minimum problem!
    However, in case your network is stuck in a bad local minimum then you need to tune your hyper parameters. You could try some of the following methods:

    1) Increasing the learning rate: If the learning rate of your algorithm is too small then it is more likely to be stuck in a local minima.
    2) Increasing hidden layers/units: It may help approximate the function better.
    3) Trying different activation functions: Make sure that the combination of activation functions is apt for your model and dataset.
    4) Trying different optimization algorithms: Instead of the conventional gradient descent, try using algorithms like Adam’s optimizer and RMSProp, Adagrad, Adadelta, RMSprop, and SGD 

  8. Difference between Back-propagation and Feed-forward Neural Network

    – Feed forward is algorithm to calculate output vector from input vector.
    – Back propagation is algorithm to adjust weight of neural network.

    During training of neural network, all types of networks using Feed Forward and Backpropagation Algorithms
    In production, it is optional to use Back propagation

    Feed Forward Neural Networks
    use back-propagation during training time only
    In these types of neural networks information flows in only one direction i.e. from input layer to output layer.
    When the weights are once decided, they are not usually changed.
    The nodes here do their job without being aware whether results produced are accurate or not(i.e. they don’t re-adjust according to result produced).
    There is no communication back from the layers ahead.
    Feed Forward Limitations:
    – Can’t handle sequential data
    – Considers only the current input
    – Can’t memorize previous inputs

    Recurrent Neural Networks (Back-Propagating)
    use back-propagation during training time and production use. also Information passes from input layer to output layer to produce result. Error in result is then communicated back to previous layers now.
    Nodes get to know how much they contributed in the answer being wrong. Weights are re-adjusted. Neural network is improved. It learns.
    There is bi-directional flow of information. This basically has both algorithms implemented, feed-forward and back-propagation.

  9. Popular Neural Networks

    • Multilayer Perceptrons (MLPs) OR Feed Forward Neural Network: Used in general Regression and Classification problems
    • Convolutional Neural Networks (CNNs) : Used for Image Recognition
    • Recurrent Neural Networks (RNNs) : Used for Speech Recognition
    • Deep Belief Network: Used for Cancer Detection
  10. Multilayer Perceptrons (MLPs)

    class of feedforward artificial neural network (ANN)

    it is a classical type of neural network. They are comprised of one or more layers of neurons. Data is fed to the input layer, there may be one or more hidden layers providing levels of abstraction, and predictions are made on the output layer, also called the visible layer.

    They are very flexible and can be used generally to learn a mapping from inputs to outputs.

    This flexibility allows them to be applied to other types of data. For example, the pixels of an image can be reduced down to one long row of data and fed into a MLP. The words of a document can also be reduced to one long row of data and fed to a MLP. Even the lag observations for a time series prediction problem can be reduced to a long row of data and fed to a MLP.




    Model of a Simple Network

    Use MLPs For:

    • Tabular datasets
    • Classification prediction problems
    • Regression prediction problems


    for example,  a dataset of gray scale images with the standardized size of 32×32 pixels each, a traditional feedforward neural network would require 1024 input weights (plus one bias).
    This is fair enough, but the flattening of the image matrix of pixels to a long vector of pixel values loses all of the spatial structure in the image. Unless all of the images are perfectly resized, the neural network will have great difficulty with the problem.


  11. RNN

    RNN is a neural network with memory (It can memorize previous inputs to help in predict the next)

    So, RNN works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer.

    RNN can be used in NLP, Time Series Prediction, Machine Translation, etc.

    RNN types :

    – One-to-One: known as Vanilla NN, An observation as input mapped to one output
    – One-to-Many: An observation as input mapped to a sequence with multiple steps as an output. for example an image can convert to dog catch a ball
    – Many-to-One: A sequence of multiple steps as input mapped to class or quantity prediction. for example in sentimental analysis many works feed to classify it as positive or negative
    – Many-to-Many: A sequence of multiple steps as input mapped to a sequence with multiple steps as output. for example machine translation, many words in input mapped to many words in output


    But we may face gradient problem (Vanishing or Exploding)




    while traning a RNN, Slope (Loss of information through time) can be either too small or very large and this makes training difficult.

    Exploding Gradient Problem: when the slope too heigh
    Vanishing Gradient Problem: when the slope too small

    Issue in Gradient Problem
    – Long training time
    – Poor performance
    – Bad accuracy


    Solution for Exploding Gradient Problem:
    1) Identity Initialization
    2) Truncated Backpropagation
    3) Gradient Clipping


    Solution for Vanishing Gradient Problem:
    1) Weight Initialization
    2) Choosing the right activation function
    3) Long Short-Term Memory Network (LSTMs)


    The Long Short-Term Memory (LSTM) network is perhaps the most successful RNN because it overcomes the problems of training a recurrent network and in turn has been used on a wide range of applications.



    A Recurrent Neural Network looks something like this:

  12. CNN (Convolutional Neural Networks, also called ConvNet)

    CNN is a feed forward neural network that is generally used for Image recognition and object classification.

    Used for object recognition tasks such as handwritten digit recognition.

    CNN considers only the current input

    CNN has 4 layers namely: Convolution layer, ReLU layer, Pooling and Fully Connected Layer. Every layer has its own functionality and performs feature extractions and finds out hidden patterns.


    Below is an example of how CNN looks like:



  13. CNN vs RNN Summary

    CNN is a feed forward neural network that is generally used for Image recognition and object classification.

    While RNN works on the principle of saving the output of a layer and feeding this back to the input in order to predict the output of the layer.

    CNN considers only the current input
    while RNN considers the current input and also the previously received inputs. It can memorize previous inputs due to its internal memory.

    RNN can handle sequential data while CNN cannot.

    In RNN, the previous states is fed as input to the current state of the network. RNN can be used in NLP, Time Series Prediction, Machine Translation, etc.

    CNN has 4 layers namely: Convolution layer, ReLU layer, Pooling and Fully Connected Layer. Every layer has its own functionality and performs feature extractions and finds out hidden patterns.

    There are 4 types of RNN namely: One to One, One to Many, Many to One and Many to Many.
    (because RNNs has designed to work with sequence prediction problems)


    Use CNNs For:

    • Image data
    • Classification prediction problems (document classification / sentiment analysis )
    • Regression prediction problems
    • Text data
    • Time series data
    • Sequence input data



    Use RNNs For:

    • Text data
    • Speech data
    • Classification prediction problems
    • Regression prediction problems
    • Generative models


    Don’t Use RNNs For:

    • Tabular data
    • Image data
  14. COCO Dataset (Common Objects In Context Dataset)

    To train Deep Learning network to detect an object, we need lots of pictures of the kinds objects that we want to detect.
    to save time, there are a several public datasets of images already exist.
    There’s a popular dataset called COCO (short for Common Objects In Context) that has images annotated with object masks.
    In this dataset, there are over 12,000 images


    COCO model knows how to detect 80 different common objects, Here is a full list of them.

    class_names = ['BG', 'person', 'bicycle', 'car', 'motorcycle', 'airplane',
    'bus', 'train', 'truck', 'boat', 'traffic light',
    'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird',
    'cat', 'dog', 'horse', 'sheep', 'cow', 'elephant', 'bear',
    'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie',
    'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
    'kite', 'baseball bat', 'baseball glove', 'skateboard',
    'surfboard', 'tennis racket', 'bottle', 'wine glass', 'cup',
    'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
    'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
    'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed',
    'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote',
    'keyboard', 'cell phone', 'microwave', 'oven', 'toaster',
    'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors',
    'teddy bear', 'hair drier', 'toothbrush']
  15. YOLO3 for Image processing

    CNN Limitation

    CNN (2012) goes pixle by pixle to detect an object, also have to scan the same image multiple times to detect all objects and this consume alot of time

    CNN has been improved over years, R-CNN (2013) , Fast R- CNN (2015) , Faster R-CNN (2015), and Mask R-CNN (2017)
    [ Mask R-CNN extending Faster R-CNN techniques and aim to locate exact pixels of each object instead of just bounding boxes]

    While R-CNN family tend to very accurate, the biggest problem with the R-CNN family of networks is their speed
    they were incredibly slow, obtaining only 5 FPS on a GPU.


    To help increase the speed of deep learning-based object detectors, YOLO (2015) use a one-stage detector strategy.
    Yolo algorithm treat object detection as a regression problem, taking a given input image and simultaneously learning bounding box coordinates and corresponding class label probabilities.

    YOLO2 capable of detecting over 9,000 object detectors. can obtaining 45 FPS on a GPU.

    YOLO2 able to achieve such a large number of object detections by performing joint training for both object detection and classification. Using joint training the authors trained YOLO9000 simultaneously on both the ImageNet classification dataset and COCO detection dataset. The result is a YOLO model, called YOLO9000, that can predict detections for object classes that don’t have labeled detection data.

    while YOLO2 can detect 9,000 separate classes, the accuracy is not quite what we would desire.

    Yolo3 (2018):

     a newer deep learning approach, that combines the accuracy of CNNs with clever design and efficiency tricks that greatly speed up the detection process. This will run relatively fast (on a GPU) as long as we have a lot of training data to train the model.

    YOLO object detection algorithm
    YOLO take an image and split it into an SxS grid, within each of the grid we take m bounding boxes. For each of the bounding box, the network outputs a class probability and offset values for the bounding box. The bounding boxes having the class probability above a threshold value is selected and used to locate the object within the image.




    Limitation and drawback of the YOLO object detector :

    1) It does not always handle small objects well
    2) It especially does not handle objects grouped close together

    The reason for this limitation is due to the YOLO algorithm itself:

    The YOLO object detector divides an input image into an SxS grid where each cell in the grid predicts only a single object.
    If there exist multiple, small objects in a single cell then YOLO will be unable to detect them, ultimately leading to missed object detections.
    Therefore, if you know your dataset consists of many small objects grouped close together then you should not use the YOLO object detector.

    In terms of small objects, Faster R-CNN tends to work the best; however, it’s also the slowest.


    Example of YOLO limitation: YOLO can detect only one of the two wine glasses


    Guidelines when picking an object detector for a given problem:

    – need to detect small objects and speed is not a concern,  use Faster R-CNN.
    – speed is absolutely paramount, use YOLO.
    – need balance between the YOLO/Faster R-CNN, use SSDs or RetinaNet


    YOLO implementations 

    Currently there are 3 main implementations of YOLO, each one of them with advantages and disadvantages
    1) Darknet (https://pjreddie.com/darknet/).
    This is the “official” implementation, created by the same people behind the algorithm. It is written in C with CUDA, hence it supports GPU computation. It is actually a complete neural network framework, so it really can be used for other objectives besides YOLO detection. The disadvantage is that, since it is written from the ground up (not based on a stablished neural network framework) it may be more difficult to find answers for errors.

    2) AlexeyAB/darknet (https://github.com/AlexeyAB/darknet).
    it is actually a fork of Darknet to support Windows and Linux. it is an excellent source to find tips and recommendations about YOLO in general, how to prepare you training set, how to train the network, how to improve object detection, etc.

    3) Darkflow (https://github.com/thtrieu/darkflow/).
    This is port of Darknet to work over TensorFlow. This is the system I have used the most, mainly because I started this project without having a GPU to train the network and apparently using CPU-only Darkflow is several times faster than the original Darkent. AFAIK the main disadvantage is that it has not been updated to YOLOv3.


    All these implementations come “ready to use”, which means you only need to download and install them to start detecting images or videos right away using already trained weights available to download. Naturally this detection will be limited to classes contained in the datasets used to obtain this weights.

  16. Sample YOLO script in Python to detect Objects

    Before run the next code you need to install Python x64 from https://www.python.org/downloads/
    then open command line as an Administrator and run the next two commands

    pip install numpy
    pip install opencv-python

    Python code to detect objects using OpenCV library

     # USAGE 
    # python yolo.py --image images/baggage_claim.jpg

    # import the necessary packages
    import numpy as np
    import argparse
    import time
    import cv2
    import os

    # construct the argument parse and parse the arguments
    ap = argparse.ArgumentParser()
    ap.add_argument("-i", "--image", required=True, help="path to input image")
    ap.add_argument("-c", "--confidence", type=float, default=0.5, help="minimum probability to filter weak detections")
    ap.add_argument("-t", "--threshold", type=float, default=0.3, help="objects Overlap less than, normally between 0.3 and 0.5")
    args = vars(ap.parse_args())


    # derive the paths to the YOLO weights and model configuration
    weightsPath = os.path.sep.join([YoloConfigDirectory, "yolov3.weights"])
    configPath = os.path.sep.join([YoloConfigDirectory, "yolov3.cfg"])
    # load the COCO class labels our YOLO model was trained on
    labelsPath = os.path.sep.join([YoloConfigDirectory, "coco.names"])
    LABELS = open(labelsPath).read().strip().split("\n")

    # initialize a list of colors to represent each possible class label
    COLORS = np.random.randint(0, 255, size=(len(LABELS), 3), dtype="uint8")

    # load our YOLO object detector trained on COCO dataset (80 classes)
    print("[INFO] loading YOLO from disk...")
    net = cv2.dnn.readNetFromDarknet(configPath, weightsPath)

    # load our input image and grab its spatial dimensions
    image = cv2.imread(args["image"])
    (H, W) = image.shape[:2]

    # determine only the *output* layer names that we need from YOLO
    ln = net.getLayerNames()
    ln = [ln[i[0] - 1] for i in net.getUnconnectedOutLayers()]

    # construct a blob from the input image and then perform a forward
    # pass of the YOLO object detector, giving us our bounding boxes and
    # associated probabilities
    blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416), swapRB=True, crop=False)
    start = time.time()
    layerOutputs = net.forward(ln)
    end = time.time()

    # show timing information on YOLO
    print("[INFO] YOLO took {:.6f} seconds".format(end - start))

    # initialize our lists of detected bounding boxes, confidences, and
    # class IDs, respectively
    boxes = []
    confidences = []
    classIDs = []

    # loop over each of the layer outputs
    for output in layerOutputs:
      # loop over each of the detections
      for detection in output:
        # extract the class ID and confidence (i.e., probability) of the current object detection
        scores = detection[5:]
        classID = np.argmax(scores)
        confidence = scores[classID]

        # filter out weak predictions by ensuring the detected
        # probability is greater than the minimum probability
        if confidence > args["confidence"]:
          # scale the bounding box coordinates back relative to the
          # size of the image, keeping in mind that YOLO actually
          # returns the center (x, y)-coordinates of the bounding
          # box followed by the boxes' width and height
          box = detection[0:4] * np.array([W, H, W, H])
          (centerX, centerY, width, height) = box.astype("int")

          # use the center (x, y)-coordinates to derive the top and
          # and left corner of the bounding box
          x = int(centerX - (width / 2))
          y = int(centerY - (height / 2))

          # update our list of bounding box coordinates, confidences,
          # and class IDs
          boxes.append([x, y, int(width), int(height)])

    #Apply “non-max suppression” that eliminate possible duplicate objects and leave the most exact of them
    idxs = cv2.dnn.NMSBoxes(boxes, confidences, args["confidence"], args["threshold"])

    # ensure at least one detection exists
    if len(idxs) > 0:
      # loop over the indexes we are keeping
      for i in idxs.flatten():
        # extract the bounding box coordinates
        (x, y) = (boxes[i][0], boxes[i][1])
        (w, h) = (boxes[i][2], boxes[i][3])

        # draw a bounding box rectangle and label on the image
        color = [int(c) for c in COLORS[classIDs[i]]]
        cv2.rectangle(image, (x, y), (x + w, y + h), color, 2)
        text = "{}: {:.4f}".format(LABELS[classIDs[i]], confidences[i])
        cv2.putText(image, text, (x, y - 5), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

    # show the output image
    cv2.imshow("Image", image)


    This code assume that we have a folder in the same python script path with name “yolo-coco” contains 3 files, these files are a model files (pre-trained object detector on the COCO dataset)




    Python Script takes 3 parameters:
    –image : The path to the input image. We’ll detect objects in this image using YOLO.
    –confidence : Minimum probability to filter weak detections. default value is 0.5
    –threshold : This is our non-maxima suppression threshold with a default value of 0.3


    tume threshold to avoid detect the same object many time





    Here is the script result after run and feed with one image





    Read Sourcecode from



    Download the complete souce code from



  17. What are the Best Free Image Datasets for Computer Vision?

    Google has released its open-source image dataset “Open Image V5” in 2019 to become the most big and free dataset availabe now contains ~9 million images that have been annotated with labels spanning over 6,000 categories, for more information about dataset and how to get it please visit https://storage.googleapis.com/openimages/web/index.html


    Other Image Datasets for Computer Vision Training

    1) Visual Genome: (convert image to words) Visual Genome is a dataset and knowledge base created in an effort to connect structured image concepts to language. The database features detailed visual knowledge base with captioning of 108,077 images.


    2) VisualQA: VQA is a dataset containing open-ended questions about 265,016 images. These questions require an understanding of vision and language. So, we can ask “How many children are in the bed?” or “Where is the child sitting?”


    3) CelebFaces: Face dataset with more than 200,000 celebrity images, each with 40 attribute annotations (Like Wavy Hair, Smile, Mustache…)


    4) CompCars: Contains 163 car makes with 1,716 car models, with each car model labeled with five attributes, including maximum speed, displacement, number of doors, number of seats, and type of car.


    5) Indoor Scene Recognition: A very specific dataset. Contains 67 Indoor categories (Like detect Store, Work Place, Home, Public Space), and a total of 15620 images.


    6) Labelled Faces in the Wild: 13,000 labeled images of human faces, for use in developing applications that involve facial recognition to get the personal name of person after capture his image.


    7) Stanford Dogs Dataset: Contains 20,580 images and 120 different dog breed categories, with about 150 images per class.


    8) Places: Scene-centric database with 205 scene categories and 2.5 million images with a category label, can detect indoor, outdoor,open area, natural light, clouds, sunny,…


    9) Flowers: Dataset of images of flowers commonly found in the UK consisting of 102 different categories. Each flower class consists of between 40 and 258 images with different pose and light variations.


    10) Plant Image Analysis: A collection of datasets spanning over 1 million images of plants. Can choose from 11 species of plants.


    11) Home Objects: A dataset that contains random objects from home, mostly from kitchen, bathroom and living room split into training and test datasets.


Join The Discussion