Frequently Asked Questions

Installation & Setup

Do I need GPU acceleration to use this?

No. This works on CPU, though GPU acceleration (CUDA/Metal) runs models faster. For development tasks like autocomplete and refactoring, modern CPUs handle models like Mistral 7B and Llama 2 13B comfortably. We recommend 16GB RAM minimum for smooth performance.

Which local models are recommended?

Start with Mistral 7B or Llama 2 13B via Ollama. Both run on 8GB systems and are fast enough for real-time IDE responses. For code understanding, Codestral 22B is excellent if you have 24GB+ RAM. You can switch models per workspace.

How do I install Ollama and download models?

Visit ollama.ai, install for your OS, then run ollama pull mistral to download. Our extension auto-detects Ollama on localhost:11434. For custom model paths or remote servers, check the Settings tab in VSCode.

Privacy & Security

Does this send my code to the cloud?

No. Everything runs on your machine. Code never leaves your device unless you explicitly use cloud models. We log completions locally only for undo/redo. No telemetry, no tracking, no external servers.

What about secrets in my code?

Keep secrets out of completions by masking API keys in your prompts. We recommend using environment variables and injecting them at runtime, not in prompts sent to the model. The model never sees your .env files unless you paste their contents into a prompt.

Is this safe for enterprise use?

Yes. Running locally means compliance teams can audit the exact model binary and see all inference happening on your network. No SaaS vendor, no data residency concerns. Deploy across your company using standard model distribution.

Performance & Limitations

Why is the first completion slow?

The model context loads on first use. Mistral 7B takes 2-3 seconds on CPU, under 1 second on GPU. Subsequent completions are instant. Restarting VSCode will reload the model context again.

Can I use this for very long files?

Models have token limits (Mistral: 8K, Llama 13B: 4K). For files larger than the context window, we send only the relevant section around your cursor plus a summary of earlier code. This keeps responses relevant without hitting token limits.

Does it work offline?

Completely. Once your model is downloaded and Ollama is running, you never need internet. Perfect for flights, offline work, or air-gapped networks. No internet check-ins.

Usage & Customization

How do I customize the system prompt?

Edit the Extension Settings in VSCode: search "Local LLM Prompt". Add language-specific instructions, coding standards, or domain knowledge. Changes apply immediately to new completions.

Can I use this with multiple projects?

Yes. Use workspace settings to set different models or prompts per project folder. A TypeScript monorepo can use one model while a Python project uses another. Settings are stored in your workspace .vscode/settings.json.

Does this interfere with GitHub Copilot?

No. This extension acts as a separate provider. You can disable Copilot's completions and rely on local models, or disable this extension and keep Copilot. They don't conflict. Check the Completions provider in VSCode settings.

Troubleshooting

Extension doesn't find Ollama

Ensure Ollama is running: ollama serve in terminal. Check that localhost:11434 is reachable. If using remote Ollama, set the custom endpoint in Settings. Restart VSCode after changes.

Completions are poor quality

Try a larger model (Mistral 7B > Phi 2.7B). Improve your system prompt with clear coding guidelines. For specialized languages, fine-tuning your local model on domain examples produces better results than generic models.

How do I report a bug?

Open an issue on our GitHub repository with: VSCode version, model name, the exact prompt that failed, and the response. Logs are in VSCode's Output panel under "Local LLM".