Facebook AI’s #Ego4D: Training Next-Generation AI

4 min readOct 18, 2021

Facebook recently unveiled #Ego4D, a research project that strives to create AI-powered technologies that can interact with the real world from an egocentric or first-person perspective as humans do instinctively. The outcomes of this project will unlock a new era of immersive AI experiences that bring physical, augmented, and virtual reality together.

Use-cases for #Ego4D’s research are boundless. Augmented Reality (AR) glasses and Virtual Reality (VR) headsets, will now have the capabilities needed to integrate into and enhance our everyday lives. Imagine a world where you can learn new activities from your AR glasses– how to play the drums, cook a 5-course meal, or even understand a different language in real-time. “Inception” will no longer be a fictitious concept from a movie, but a capability you can use to replay your favorite memories at will. This is the future that #Ego4d is creating.

“Next-generation AI systems will need to learn from an entirely different kind of data — videos that show the world from the center of the action, rather than the sidelines”
- Kristen Grauman, Lead research scientist at Facebook.

So how do researchers train their AI models to see the world as humans do? The first step was to source for better quality data from a wider breadth of sources than had been used in past egocentric perception projects. Historically, most research teams use millions of photos and videos captured from the third-person perspective to train their machine learning models, but in the #Ego4D project, researchers used images and videos that were captured from the first-person perspective. That data consisted of over 700 participants across nine countries recording their everyday lives, providing a diversity of perspectives and a richness that has given birth to what many researchers are calling ‘Embodied AI’– robots and machines that can engage the world as we do.

Beyond capturing this rich dataset, defining the right benchmarks was the next step in model training, paving the way for consensus alignment of what it meant to have intelligent egocentric perception. These benchmarks focus on understanding how a person relates with their environment, with objects, and with other people. They take into account daily events of the past, present, and future, and aim to understand what you would like to remember, what you are doing right now, or what you would like to do next.

These benchmark tasks are:

Episodic memory: What happened when? (e.g., “Where did I leave my keys?”)
Forecasting: What am I likely to do next? (e.g., “I just ate pizza, should I have a soda with it next?”)
Hand and object manipulation: What am I doing and how? (e.g., “Teach me how to play the drums.”)
Audio-visual diarization (AVD): Who said what when? (e.g., “What was the main topic during class?”)
Social interaction: How are we interacting? (e.g., “Help me better hear the person talking to me at this noisy restaurant.”)

Building these benchmarks required rigorous and complex annotations carried out by third-party annotators. What made annotations in this project so complex was that every object in every frame of thousands of hours of videos had to be labeled according to the benchmarks mentioned previously. For example, to achieve the benchmark for hand and object manipulation, annotators were required to determine at what point in time each action took place, down to the millisecond. Each frame then needed to be marked in four different ways before the objects in the frame could be labeled. This level of annotation is what makes #Ego4d such a rich project capable of creating smarter, more interactive, and flexible AI.

Perhaps the most exciting part of #Ego4d is not just the wide range of scenes, people, and activities it includes, but also the diversity of the annotation teams. This inevitably reduces the bias inherent in most AI-powered products on the market today. Thanks to these diverse teams, AI-applications will recognize the different types of curry for example, as well as the fact that curry can mean different things to different people in different places.

The #Ego4d project is redefining what it means to exist in a world with AI and ultimately what it means to build inclusive, representative AI applications.

References:

https://ai.facebook.com/blog/teaching-ai-to-perceive-the-world-through-your-eyes

Project Page: https://ego4d-data.org/

— -

As a mission-driven company, Hugo exists to help make technology work for more communities. We are built to tackle complex and nuanced annotation projects, like #Ego4d, that helps researchers and technologists develop better, less-biased AI-powered applications. We are the world’s largest black-owned data annotation provider, with over 1200 university-educated annotators working for FANGMA and other tech giants. Contact us to learn how we can help you launch into the future of AI.

Facebook AI’s #Ego4D: Training Next-Generation AI

Written by Hugo | ML