CodeNav: Beyond tool-use to using real-world codebases with LLM agents

PRIOR @ Allen Institute for Artificial Intelligence

CodeNav solves user queries by directly search, importing, and using code from any target codebase without requiring manual tool-registration.


We present CodeNav, an LLM agent that navigates and leverages previously unseen code repositories to solve user queries. In contrast to tool-use LLM agents that require "registration" of all relevant tools via manual descriptions within the LLM context, CodeNav automatically indexes and searches over code blocks in the target codebase, finds relevant code snippets, imports them, and uses them to iteratively generate a solution with execution feedback.

In this project page, we showcase the core capabilities of CodeNav through case studies on 3 different codebases, visualize CodeNav's predictions on tool-use benchmarks, highlight key experimental insights, and provide side-by-side comparison of library and tool descriptions.

Case Studies

To highlight the core-capabilities of CodeNav, we showcase three case studies where we use CodeNav for solving complex user queries using three diverse codebases.

Case Study 1: Multimodal Reasoning & Editing

Using CodeNav for multimodal applications simply means specifying a codebase that has multimodal functionality such as functions or classes for detection, segmentation, visual question answering, image generation and editing, speech recognition etc. In this case study, we use CodeNav to solve a multimodal task using tool implementation from the m&m's benchmark as the target codebase.

Interactions Visualization bbt.html

Case Study 2: Research Assistant

Here we demonstrate the versatility of code-use by asking CodeNav to act as a research assistant. To solve this query the agent needs to search Wikipedia, internet and arXiv to find relevant definitions, news articles, and research papers. Finally, the agent needs to present this information to the user in an accessible format. We use the phidata codebase for this task.

Interactions Visualization alphafold_study.html

Case Study 3: CodeNav using the CodeNav codebase

This example demonstrates a truly real-world use case where you might want to use code-use agents to help run experiments using your own codebase. Here, we instruct CodeNav to use the CodeNav codebase itself to run an experiment. The experiment involves setting up the CodeNav agent and various environments and then running an interaction to solve a query based on the Huggingface transformers codebase. In other words, not only does the top-level CodeNav agent need to understand and use the codebase written by the CodeNav authors to understand how to create and setup the agent and the environments, but the agent created itself needs to figure out how to use the transformers repository.

Interactions Visualization detections.jpg

Qualitative Results on Tool-use Benchmarks


The m&m's benchmark evaluates the ability to use computer vision, NLP, audio processing, and knowledge querying tools and APIs. Here we visualize 200 randomly selected user queries along with ground truth code, CodeNav generated code, per-sample metrics, and interaction trajectories.



M3ToolEval assess tool-use capabilities on a range of tasks like travel planning, DNA sequencing, web browsing (with templated web-pages), and encoding/decoding messages. Here, we visualize all user queries with GT and predicted answers along with CodeNav's interaction trajectories.


Key Insights

On tool-use benchmarks where tool registration is possible due to limited number of tools, code-use is competitive with tool-use

Tool-use forms an upper bound on the performance of code-use agents. This is because manual registration of tools provides strictly more information to the agent as compared to code-use where the agent needs to first search or discover the tools in the codebase. However, we find that with powerful LLMs like GPT-4, code-use agents are fairly competitive with tool-use agents and this gap is likely to reduce further with even better LLMs.

Library descriptions succintly provide necessary information to the code-use agent while being less verbose than tool descriptions

To use a codebase successfully, the more the agent knows about the codebase the better. In the first 4 rows, we progressively provide more information (increasing description length) to the code-use agent and find that providing tool names and descriptions lead to significant improvements in performance.

However, codebases are not just a random collection of tools, but are structured and organized to help developers navigate and use them. This structure can be exploited by providing a succint library description (last row) to the agent. This description does not need to provide exact function or class names but act as a README to provide the agent with just enough information to search the codebase effectively to find the relevant implementation details. We provide a few examples of tool and library descriptions below.

An area of improvement for open source LLMs

To be successful in code-use, an LLM must not only be good at natural language and code understanding, but must also have a sufficient context length to be able to ingest long interaction sequences that include retrieved and generated source code as well as execution results. In this regard, proprietary LLMs like GPT-4 significantly exceed the capabilies of current open-source alternatives.

Library vs Tool Descriptions

Here, we provide a side-by-side comparison of library and tool descriptions for the m&m's and M3ToolEval codebases. Note that the library descriptions are more concise and provide just enough information for the agent to search the codebase effectively without directly providing implementation details like the exact function/class names or arguments

Codebase: M3ToolEval

Library Description

The codebase you will use is called m3eval. It has the following directory structure:

  • m3eval/ - Function for planning travel including finding flights, making hotel reservations, and budget calculations.
  • m3eval/ - Contains the WebBrowser class for navigating web pages.
  • m3eval/ - Various functions related to DNA sequencing.
  • m3eval/ - Functions for encoding and decoding messages, including converting hex string to ASCII and decoding Caesar cipher.
  • m3eval/ - Functions for currency conversion, calculating tariffs, etc.

Tool Description

The codebase you will use is called m3eval. It has the following files and functions implemented in those files:

  • file: m3eval/
    • convert_hex_to_ascii(hex_string: str) -> str: Converts a hexadecimal string to ASCII. Arguments: hex_string (str)
    • reverse_string(string: str) -> str: Reverses a string. Arguments: string (str)
    • caesar_decode(message: str, shift: int) -> str: Decodes a string using the Caesar cipher. Arguments: message (str), shift (int)
    • string_length(string: str) -> int: Finds the length of a string. Arguments: string (str)
    • minimum_value(*args) -> int/float: Finds the minimum value from given arguments. Arguments: *args (variable number of arguments)
    • maximum_value(*args) -> int/float: Finds the maximum value from given arguments. Arguments: *args (variable number of arguments)
  • file: m3eval/
    • count_nucleotides(dna_sequence: str) -> dict: Counts the occurrences of each nucleotide in a DNA sequence. Arguments: dna_sequence (str)
    • transcribe_dna_to_mrna(dna_sequence: str) -> str: Transcribes DNA sequence to mRNA. Arguments: dna_sequence (str)
    • translate_mrna_to_amino_acid(mrna_sequence: str) -> str: Translates mRNA sequence to a chain of amino acids. Arguments: mrna_sequence (str)
    • find_max_nucleotide(*args) -> (str, int): Returns the nucleotide (str) with the maximum count (int). Arguments: nucleotide_counts in the form of (k1, v1, k2, v2, ..., kn, vn)
    • is_valid_dna_sequence(dna_sequence: str) -> bool: Checks if the DNA sequence is valid. Arguments: dna_sequence (str)
    • reverse_transcribe_mrna_to_dna(mrna_sequence: str) -> str: Reverse transcribes mRNA sequence to DNA. Arguments: mrna_sequence (str)
  • file: m3eval/
    • convert_currency(base_price: float, conversion_rate: float) -> float: Converts the commodity price to local currency. Arguments: base_price (float), conversion_rate (float)
    • calculate_tariff(price: float, tariff_rate: float) -> float: Calculates the trade tariff based on the converted price. Arguments: price (float), tariff_rate (float, in %)
    • estimate_final_value(price: float, tariff: float) -> float: Estimates the final trade value including the tariff. Arguments: price (float), tariff (float)
    • calculator(expression: str) -> float: Evaluates the given expression and returns the result. Accepts a calculation expression as input. For example, "2 + (3 * 4)" will return 14.
    • find_minimum(*args: float) -> float: Finds the minimum value among the given arguments. Accepts a variable number of float arguments.
    • find_maximum(*args: float) -> float: Finds the maximum value among the given arguments. Accepts a variable number of float arguments.
  • file: m3eval/
    • find_flights(from_location: str, to_location: str, date: str) -> List[Dict]: Finds flights based on source, destination, and date. Arguments: from_location (str), to_location (str), date (str) in YYYY-MM-DD format. Returns a list of flights, each represented as a dictionary with keys "from_location", "to_location" (destination), "date", and "price". Example: [{"from_location": "A", "to_location": "B", "date": "2023-12-25", "price": 450}]
    • book_hotel(location: str, *preferences: str) -> List[Dict]: Books a hotel based on location and preferences. Arguments: location (str), *preferences (variable number of str arguments). Returns a list of hotels, each represented as a dictionary with keys "location", "preferences", "price_per_night", and "rating". Example: [{"location": "A", "preferences": ["wifi", "pool"], "price_per_night": 120, "rating": 4}]
    • budget_calculator(flight_price: float, hotel_price_per_night: float, num_nights: int) -> float: Calculates the total budget for a trip. Arguments: flight_price (float), hotel_price_per_night (float), num_nights (int). Returns the total budget (float).
  • file: m3eval/

    Note: To use the browser functions, first create a browser instance using browser = WebBrowser()

    • browser.click_url(url: str) -> str: Clicks on a URL. A clickable URL looks like [Clickable '<url_argument>'] in the webpage. Arguments: url (str). Returns the rendered content of the webpage after clicking the URL showing on the current rendered page.
    • browser.go_to_previous_page() -> str: Goes back to the previous page. It has no arguments. After going back to the previous page, returns the rendered content of the webpage.
    • browser.scroll_down() -> str: Scrolls down the view. It has no arguments. Returns the rendered content of the webpage after scrolling down.
    • browser.scroll_up() -> str: Scrolls up the view. It has no arguments. Returns the rendered content of the webpage after scrolling up.
    • browser.view() -> str: Returns the current view in string format of the rendered webpage. It has no arguments. You should call this when you want to see the rendered content of the current webpage.

Codebase: m&m's

Library Description

The codebase you'll be using to solve user tasks is called mnm. It has a single file which contains functions for various image, text, and audio related tasks. Specifically, here's a high-level summary of available functions:

  • Text understanding functions: for tasks like text generation, summarization, classification, answering questions based on a text context
  • Image understanding functions: for tasks like image classification (1000 IMAGENET categories only), image captioning, answering questions about an image (use this if you have a question that can't be answered with IMAGENET classification), detecting objects (producing bounding boxes and labels for COCO categories) and segmenting objects (producing segmentation masks and labels for COCO categories), transcribing alphanumeric characters in an image to text (also known as optical character recognition).
  • Image editing functions: for generating images given a text description, editing images given the original image and a description of how to modify the image (can handle queries that require replacing or removing certain objects in the scene without detecting the object first), image cropping, or achieving effects like color pop and background blur given the segmented objects which you want to highlight in the image.
  • Information retrieval functions: for retrieving factual information or interesting facts about dates, years, numbers, movies, weather, geographical coordinates of a city, Wikipedia articles, or fun trivia. Also includes a love calculator for checking compatibility given two names.
  • Object centric functions: these are functions that accept a list of detected or segmented objects for tasks like counting, selecting an object, tagging an image with objects (drawing bounding boxes and labels), or replacing objects with emojis. Note that these functions here do not detect the objects themselves.
  • Audio understanding functions: for tasks related to audio understanding like speech recognition

Tool Description

The codebase you'll be using to solve user tasks is called mnm, this codebase consists of a single file called which contains the following functions:

  • text_generation(text) -> text: It takes an input text prompt and outputs a text that is most likely to follow the input text.
  • text_summarization(text) -> text: It takes a paragraph of text and summarizes it into a few sentences.
  • text_classification(text) -> text: It takes a text and classifies it into a category in the model's vocabulary (e.g., positive or negative based on its sentiment).
  • question_answering(text, question) -> text: It takes a text and a question, and outputs an answer to that question based on the text.
  • image_generation(text) -> image: It takes a text prompt and generates an image that matches the text description.
  • image_captioning(image) -> text: It takes an image and generates a text caption of the image.
  • optical_character_recognition(image) -> text: It takes an image and outputs recognized texts in the image.
  • image_classification(image) -> text: It takes an image and classifies the subject in the image into a category such as cat or dog.
  • image_editing(image, prompt) -> image: It takes an image and a text prompt and outputs a new image based on the text.
  • object_detection(image) -> image, objects: It takes an image and outputs rectangular bounding boxes of objects detected in the image.
  • image_segmentation(image) -> image, objects: It takes an image, segments it into different parts, and outputs segmentation masks of any shape for the parts.
  • automatic_speech_recognition(audio) -> text: It takes an audio file and produces a transcription of the audio.
  • visual_question_answering(image, question) -> text: It takes an image and a question about the image, and generates an answer to the question.
  • image_crop(image, object) -> image: It takes an image and 4 numbers representing the coordinates of a bounding box and crops the image to the region within the box.
  • image_crop_left(image) -> image: It takes an image, crops and keeps the left part of the image.
  • image_crop_right(image) -> image: It takes an image, crops and keeps the right part of the image.
  • image_crop_top(image) -> image: It takes an image, crops and keeps the top part of the image.
  • image_crop_bottom(image) -> image: It takes an image, crops and keeps the bottom part of the image.
  • background_blur(image, object) -> image: It takes an image and one or multiple objects in the foreground, and returns an image where the background is blurred.
  • color_pop(image, object) -> image: It takes an image and one or multiple objects, and returns an image where only the object is colored and the rest is black and white.
  • count(objects) -> number: It takes a list of objects and returns the count of the objects.
  • tag(image, objects) -> image: It takes an image and a list of objects with their bounding boxes and classes, and tags all the objects.
  • select_object(objects, object_name) -> object: It takes a list of objects, and selects the object based on the input object name.
  • emoji(image, object, emoji) -> image: It takes an image and the bounding box coordinates of one or multiple objects, and replaces the object with an emoji (e.g., angry/flushed/crying/dizzy/sleepy/grimacing/kissing/smiling_face, alien, ghost, goblin etc).
  • get_date_fact(date) -> text: It provides interesting facts about dates.
  • get_year_fact(year) -> text: It provides interesting facts about years.
  • get_math_fact(number) -> text: It provides interesting math facts about numbers.
  • get_trivia_fact(number) -> text: It provides interesting trivia facts about numbers.
  • love_calculator(first_name, second_name) -> number: Enter your name and the name of your partner/lover/crush to find love compatibility & chances of a successful love relationship.
  • get_location(city) -> lon, lat: Convert a city name or address to geographical coordinates using OpenStreetMap's Nominatim API.
  • search_movie(movie_title, movie_year) -> text: Retrieve basic movie information, including title, year, genre, and director.
  • get_weather(lon, lat) -> objects: Provides weather forecast data based on specific geographical coordinates.
  • wikipedia_simple_search(text) -> text: Perform a basic search query on Wikipedia to retrieve a summary of the most relevant page.


  author    = {Gupta, Tanmay and Weihs, Luca and Kembhavi, Aniruddha},
  title     = {Codenav: Beyond tool-use to using real-world codebases with LLM agents},
  year      = {2024},