We are aware that Netflix is using AI to generate personalised thumbnail pics for its TV and movie content, but we think this article is the first example of using AI to create cover pics in a creator platform. Also, we have open sourced the source code for our implementation on
GitHub.
Creating video thumbnails can be time consuming for creators, and this is particularly true if they are shooting content live and then immediately moving it to an on-demand catalog. Peloton is an example of this, with over 10,000 classes that were created live, all with similar looking ( but different ) cover pics - showing the instructor smiling at camera. Peloton has the resources to do this manually, but for many local creators recording Zoom classes, finding and uploading cover pics has to be done after the event and it’s onerous. Our eureka moment was saying “surely AI could do a good enough job of this ?”. And this blog is about our experiences using AI to select thumbnail cover pics from video recordings.
So to fill the gaps of such use cases, we needed an AI service that is able to detect 4-5 frames from a stored video of a live streaming feed/uploaded manually, and the person in those frames should be looking at the camera, their eyes should be open, and they should be smiling.
To break it down :
- Once the staff has uploaded the video or Zoom is done with recording, the AI/ML service should begin searching for the facial attributes mentioned above.
- It should return the response along with the grabbed frames after analyzing.
- Return frames need to be cropped in a variety of sizes and shapes ( circular, rectangular in 16:9, 4:3, 1:1 ratios ).
After doing some investigation, we learned that there isn't a service that can perform all of these tasks. Both the Amazon Rekognition service from AWS and the Video Intelligence API from Google are capable of detecting faces and other attributes. Amazon’s Rekognition service returns the detected metadata attributes in the response and does not return the grabbed frames, whereas Google’s Video Intelligence API does return the grabbed frames but they are so small that they were not helpful to us. See it yourself below -
We looked at both
Rekognition service and
Video Intelligence API from a possible implementation point of view and here are our observations. After that, we’ll show actual side by side results we got from the two AI services for the videos we tested, and which one was better for our needs.
AWS’s Amazon Rekognition
Amazon Rekognition is a service that analyzes images/videos, finds and compares faces and returns the detected attributes in response. It only works with videos stored on S3 or streaming via Kinesis and the video must be encoded using the H.264 codec while the supported file formats are either MPEG-4 or MOV. It looks to see if the input contains a face. If so, Amazon Rekognition finds the face in the image/video, examines the facial landmarks of the face, such as the position of the eyes, and detects any emotions (such as appearing happy or sad). It then returns a percent confidence score for the face and the facial attributes that are detected in the image. Actually, it looks for the provided meta data in each frame and returns the result in response, so the more frames there are in the video the more time it is going to take to analyze it.
Response attributes :
- A Confidence score - a value between 0 and 100 indicating how confident Rekognition service is that the detected entity is the face.
- BoundingBox - identified face's width, height, and x, y coordinates.
- AgeRange - the highest and lowest estimated value between 0 and 100.
- Smile - a true/false value indicating whether the person is smiling or not, as well as a confidence score ( 0-100 ) indicating how confident AWS is in the person smiling.
- Eyeglasses - a true/false value indicating whether or not the person is wearing eyeglasses, and the confidence score ( 0-100 ) as described above.
- Sunglasses - a true/false value indicating whether or not the person is wearing sunglasses, and the same confidence score ( 0-100 ) as above.
- Gender - gender type and level of confidence.
- Beard - true/false status and level of confidence.
- Mustache - True/false status and confidence score.
- Emotion - list of emotions detected on the face, as well as a confidence score.
- Image Quality - of brightness and sharpness.
There are a few other attributes it returns in response, but the ones mentioned above were useful to us, so we have excluded the remaining ones. For a complete response, see the Rekognition documentation.
The implementation steps are -
1. The first step is to instantiate RekognitionClient as shown below -
2. The next step is to start face detection. At this point, it requests the s3 bucket-id and video to analyze. An SNS topic can also be passed in the request to chain it asynchronously and listen for the SNS notification to process the analysis result within the separate AWS lambda function or other means -
3. A second request is made to the Rekognition service to read the returned response after face detection is complete. And because Rekognition is kind of a prediction machine and the score are prediction confident values, we store timestamps and the confidence score of detected face, eyes open, and smiling attributes in an array -
4. The array that was created in the previous step is then sorted in descending order based on the confident score of the face, eyes open, and smiling attributes to obtain the best top three frames.
5. Then, we instructed ffmpeg to grab frames from the video at the timestamps of the first three items from the sorted array -
6. Rekogntion returns bounding box coordinates for each detected face, which is essentially a box around the detected faces. The idea is that one might want to label faces in an image/video, similar to how we see in several surveillance systems. A BoundingBox has the following properties :
- Height – The height of the bounding box as a ratio of the overall image height.
- Left – The left coordinate of the bounding box as a ratio of overall image width.
- Top – The top coordinate of the bounding box as a ratio of overall image height.
- Width – The width of the bounding box as a ratio of the overall image width.
We add 100px padding around the box coordinates before taking full-height stills and cropping the edges off -
Google’s Video Intelligence API
Conceptually/feature and working wise it is almost similar to Rekognition service and works with both stored and local images/videos. Standard live streaming protocols such as RTSP, RTMP, and HLS are also supported.
Response attributes :
- A Confidence score - a value between 0-1 indicating how confident Vision API is that the detected entity is the face.
- NormalizedBoundingBox - identified face's top, left, right and bottom coordinates.
- Smiling - smiling attribute name and confidence score between 0-1.
- Glasses - depicts whether the detected face has glasses put on and confidence score between 0-1.
- Eyes visible - attribute name and confidence score between 0-1.
- Looking at camera - attribute name and confidence score between 0-1.
- Mouth open - attribute name and confidence score between 0-1.
- Headwear - attribute name and confidence score between 0-1.
- timeOffset - Time in seconds and nanoseconds indicating where that frame is in the video.
- Thumbnail - detected face thumbnail.
There are a few other attributes it returns in response, which aren’t mentioned here.
The implementation steps are -
1. The first step is to set up the VideoIntelligenceServiceClient and then start the face detection by passing the URL of the stored video or the local file encoded into base64 string in the request -
2. After face detection is complete, the next step is to read the returned response. Video Intelligence divides the period for which faces are visible into little segments. And then, similar to Rekognition - Video Intelligence API returns segment start / end time as well as the detected facial attributes along with prediction score (value ranges between 0-1) for each attributes. For frame grabbing, we use the median value of segment start/end time and store it in an array along with the prediction score of eyes open, looking at camera, and smiling attributes -
3. The array that was created in the previous step is then sorted in descending order based on the confident score of the looking at camera, eyes open, and smiling attributes to obtain the best top three frames.
4. Then, we instructed "ffmpeg" to grab frames from the video at the timestamps of the first three items from the sorted array similar to what we did with Rekognition-
5. Video-Intelligence, like Rekognition, returns bounding box coordinates for each detected face, and Video-Intelligence calls it "normalizedBoundingBox," but it was incompatible with the crop function we wrote, so we ended up using Rekognition again, this time just to get the bounding box from the frame (Note - If you're going with Video-Intelligence, avoid using Rekognition and write your crop function so that it works with normalizedBoundingBox), we then used the same cropping mechanism we used for Rekogntion.
Observations
- We noticed that sometimes it can take 10-12 minutes to analyze a 5-minute video for faces with both services which could be too long. So it’s best to make a short clip of the original video and then perform the face detection on it.
- In response, both services should return the frame ( Video Intelligence API does return the frame but it is too tiny ) at which the face was detected in order to avoid using another service like AWS media convert, because both Video analysis and MediaConvert separately could be too time consuming.
Conclusion
The outcomes of our observations show that AWS's Rekognition service performs better than Google Video Intelligence API, for use cases stated in the beginning - it’s kind of obvious which one’s best when you look at the thumbnails generated from the same videos.
Results
Video1 ( recorded using Zoom )
Video2 ( recorded using Zoom )
Video3 ( recorded using Zoom )
Video4 ( recorded using IVS )
Video5 ( recorded using IVS )
Video6 ( recorded using IVS )
Note - To obtain the frames from the video, one must use ffmpeg or another service. However, ffmpeg asks for the frame’s time in seconds or "00:00:00:00" format timestamp, and converting the milliseconds in this format could occasionally cause a few milliseconds/nanoseconds inaccuracy. This is a good reason for why Rekognition and Video Intelligence API should itself return the detected frame in the original resolution.