PhD Project - Embodied Cognition in Virtual Environments with Diachronic Analysis of Linguistic and Visual Inputs
- Starting December 2020-
Historical analysis of cognitive structures has been conducted through the study of spatial relationships in historical texts, images, and maps. Recent research in machine learning presents exciting opportunities to extend this analysis to embodied cognition by placing agents in virtual environments. Consider an agent that must navigate a virtual rendering of a historical location using linguistic instructions and visual cues derived from contemporaneous guidebooks and maps. The agent will be required to process linguistic, visual, and spatial information that are related to the source documents under study. In the performance of actions, the agent will also develop representations and reasoning that correspond to the environment and inputs. In this research, agents will be placed in virtual environments to perform two tasks. First agents will navigate to a destination with the assistance of natural language instructions from texts. On reaching the destination, agents will then conduct place recognition using linguistic and visual cues derived from texts and images. The proposed research will enable diachronic analysis of cognitive structures in relation to locations and entities in the artefacts under investigation.
In this research, the project will develop and evaluate machine learning methods that apply to embodied vision and language tasks. The research plan includes combining visual, linguistic, and geospatial information in cross-modal and multimodal formulations. In the Vision and Language Navigation task, agents are required to learn from multimodal inputs, generate visual location summaries, and perform egocentric spatial location. The employment of natural language instructions in this task requires developing systems that conduct linguistic subtasks including resolving coreferences, accounting for changes in semantics over time, and handling rare words. Place recognition will employ techniques from both computational linguistics and computer vision. A multitask framework will be optimised to conduct Vision and Language Navigation and place recognition on the linguistic, visual, and geospatial inputs. Constituent models in this framework will employ brain-inspired architectures and learning methods with a focus on minimising the number of samples required to train a system on the second task. Statistical methods to be evaluated will include testing log-normal and Laplacian distributions as alternatives to Gaussian distributions when combining data.