Using LLMs to Build A Code Generation Dataset


In this article, you'll learn how to build a code generation dataset using Large Language Models (LLMs). This tutorial will guide you through scraping code data, cleaning it from various artifacts, and refining it into a structured dataset. We'll navigate the challenges and demonstrate the power of LLMs in transforming raw code into a valuable resource for code generation.

The Journey of Scraping Code

Scraping code involves extracting code snippets from various online sources. However, this raw code often contains unwanted elements like line numbers, HTML tags, or commentary that aren't part of the actual code. Let's start by scraping some code using Python's requests library.

import requests
# URL of the webpage where the code is located
url = ''
# Sending a GET request to the URL
response = requests.get(url)
# Checking if the request was successful
if response.status_code == 200:
    # Extracting the text content from the response
    raw_html = response.text
    print("html fetched successfully:")
    print(f"Failed to retrieve page. Status code: {response.status_code}")

Extracting Code from HTML

After fetching the raw HTML content using Python's requests library, the next challenge is to convert this HTML into a more readable, text-only format. This is where the html2text library becomes useful. It converts HTML into Markdown, which is much closer to the plain text we want. Here's how you can use html2text to achieve this:

import html2text
# Initialize html2text converter
converter = html2text.HTML2Text()
converter.ignore_links = True  # Optionally, ignore converting links
converter.mark_code = True  # wrap code in [code]...[/code] tags
# Converting HTML to Markdown
markdown_text = converter.handle(raw_html)
print("Converted Markdown text:")
code_blocks = re.findall(r'\[code\](.*?)\[/code\]', markdown_text, re.DOTALL)
print("Code blocks extracted:")

Introducing LLMs in the Cleaning Process

The Initial Challenge

Consider the scraped code snippet in our example. It includes line numbers and annotations irrelevant to the actual code. Cleaning this manually is tedious and error-prone.


    curl \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -H "OpenAI-Organization: YOUR_ORG_ID"

Using LLMs for Initial Cleaning

We can use LLMs to remove these artifacts. Here's a simplified version of how we might use an LLM like GPT-3.5-turbo:

import openai
# The code snippet with artifacts
example = """[code] 1 2 3 4...[/code]"""
# Basic prompt for the LLM
prompt = f"Please clean the following code snippet by removing all non-code elements like line numbers and annotations:\n{example}"
# Send the request to the OpenAI API
response = openai.Completion.create(
cleaned_code = response.choices[0].text.strip()
print("Cleaned Code:")

This script sends our code snippet to the LLM, which returns a cleaner version.

    curl \
      -H "Authorization: Bearer $OPENAI_API_KEY" \
      -H "OpenAI-Organization: YOUR_ORG_ID"

Cost Considerations

Using GPT-3.5-turbo for tasks like code cleaning or parsing documentation webpages, the expected cost per generation with this model ranges from $0.30 to $0.75 per 1000 generations depending on the code snippet length. While this might seem minimal at a glance, the expense can accumulate significantly in large-scale projects.

If a project entails parsing 100,000 snippets, at the higher end ($0.75 per 1000 generations), this cost escalates to $75. While this isn't wildly expensive, it's important to assess whether the complexity of the task justifies the use of such a high-end model.


In this post, we've seen how LLMs can be used to clean code scraped from the web. We've also seen how the cost of using such models can add up quickly. In my next post, I'll show you how to use a cheaper model to achieve the same results.

Get notified when I publish new articles.

    Unsubscribe at any time