This guide demonstrates how to use Codegen to generate high-quality training data for large language models (LLMs) by extracting function implementations along with their dependencies and usages. This approach is similar to word2vec or node2vec - given the context of a function, learn to predict the function’s implementation.
This example works with both Python and Typescript repositories without modification
Overview
The process involves three main steps:
- Finding all functions in the codebase
- Extracting their implementations, dependencies, and usages
- Generating structured training data
Let’s walk through each step using Codegen.
Step 1: Finding Functions and Their Context
First, we will do a “graph expansion” for each function - grab the function’s source, as well as the full source of all usages of the function and all dependencies.
First, let’s import the types we need from Codegen:
import codegen
from codegen import Codebase
from codegen.sdk.core.external_module import ExternalModule
from codegen.sdk.core.import_resolution import Import
from codegen.sdk.core.symbol import Symbol
Here’s how we get the full context for each function:
def get_function_context(function) -> dict:
"""Get the implementation, dependencies, and usages of a function."""
context = {
"implementation": {"source": function.source, "filepath": function.filepath},
"dependencies": [],
"usages": [],
}
# Add dependencies
for dep in function.dependencies:
# Hop through imports to find the root symbol source
if isinstance(dep, Import):
dep = hop_through_imports(dep)
context["dependencies"].append({"source": dep.source, "filepath": dep.filepath})
# Add usages
for usage in function.usages:
context["usages"].append({
"source": usage.usage_symbol.source,
"filepath": usage.usage_symbol.filepath,
})
return context
Notice how we use hop_through_imports
to resolve dependencies. When working with imports, symbols can be re-exported multiple times. For example, a helper function might be imported and re-exported through several files before being used. We need to follow this chain to find the actual implementation:
def hop_through_imports(imp: Import) -> Symbol | ExternalModule:
"""Finds the root symbol for an import."""
if isinstance(imp.imported_symbol, Import):
return hop_through_imports(imp.imported_symbol)
return imp.imported_symbol
This creates a structured representation of each function’s context:
{
"implementation": {
"source": "def process_data(input: str) -> dict: ...",
"filepath": "src/data_processor.py"
},
"dependencies": [
{
"source": "def validate_input(data: str) -> bool: ...",
"filepath": "src/validators.py"
}
],
"usages": [
{
"source": "result = process_data(user_input)",
"filepath": "src/api.py"
}
]
}
Step 2: Processing the Codebase
Next, we process all functions in the codebase to generate our training data:
def run(codebase: Codebase):
"""Generate training data using a node2vec-like approach for code embeddings."""
# Track all function contexts
training_data = {
"functions": [],
"metadata": {
"total_functions": len(codebase.functions),
"total_processed": 0,
"avg_dependencies": 0,
"avg_usages": 0,
},
}
# Process each function in the codebase
for function in codebase.functions:
# Skip if function is too small
if len(function.source.split("\n")) < 2:
continue
# Get function context
context = get_function_context(function)
# Only keep functions with enough context
if len(context["dependencies"]) + len(context["usages"]) > 0:
training_data["functions"].append(context)
# Update metadata
training_data["metadata"]["total_processed"] = len(training_data["functions"])
if training_data["functions"]:
training_data["metadata"]["avg_dependencies"] = sum(
len(f["dependencies"]) for f in training_data["functions"]
) / len(training_data["functions"])
training_data["metadata"]["avg_usages"] = sum(
len(f["usages"]) for f in training_data["functions"]
) / len(training_data["functions"])
return training_data
Step 3: Running the Generator
Finally, we can run our training data generator on any codebase.
if __name__ == "__main__":
print("Initializing codebase...")
codebase = Codebase.from_repo("fastapi/fastapi")
print("Generating training data...")
training_data = run(codebase)
print("Saving training data...")
with open("training_data.json", "w") as f:
json.dump(training_data, f, indent=2)
print("Training data saved to training_data.json")
This will:
- Load the target codebase
- Process all functions
- Save the structured training data to a JSON file
Using the Training Data
The generated data can be used to train LLMs in several ways:
- Masked Function Prediction: Hide a function’s implementation and predict it from dependencies and usages
- Code Embeddings: Generate embeddings that capture semantic relationships between functions
- Dependency Prediction: Learn to predict which functions are likely to be dependencies
- Usage Pattern Learning: Train models to understand common usage patterns
For example, to create a masked prediction task:
def create_training_example(function_data):
"""Create a masked prediction example from function data."""
return {
"context": {
"dependencies": function_data["dependencies"],
"usages": function_data["usages"]
},
"target": function_data["implementation"]
}
# Create training examples
examples = [create_training_example(f) for f in training_data["functions"]]