"LLMs know how to code better than you": true or false?

Published on 07/22/2024 by Kevin Séjourné, Senior R&D Engineer at Cloud Temple

Diving into AI: A series of 3 episodes

Hi, I'm Kevin Séjourné, PhD in computer science and senior R&D engineer at Cloud Temple. As you can imagine, I've been writing a lot of code over the last 20 years. As a passionate explorer of LLMs, I've realised that they can now write code for me. So much the better! But as I'm used to relying on scientific observations, I decided to test the quality of their work.

Kevin Séjourné

Watch the 3 episodes of my study :

Episode 1: I automate my work with LLMs
Episode 2: Let's use LLM to correct the code and rebuild forgotten algorithms
Episode 3: Completing the initial programme, testing, testing harder

Episode 3: Completing the initial programme, testing, testing harder

In this article, we'll explore how we completed the backend program generated with LLM, tested it and optimised its deployment with Docker. We'll also look at how to use GPT4o and other tools to generate code more efficiently.

CI/CD : Using Docker with GPT4o

It's a programme backenddeployment based on Docker is the standard, before handing over to devops. Let's ask GPT4o to create a Dockerfile adapted to our programme.

An image based on alpine:latest. Why not? However, compiling native code on alpine requires musl as the standard library instead of glibcabsent on alpine.

musl does not pose any particular problem, but GPT-4 has omitted the option -target x86_64-unknown-linux-musl in the cargo ship. Without this option, the code would have been impossible to compile in the image alpine. If we hadn't encountered this problem, we'd have been guaranteed many hours of head-scratching before we figured out why the container was being used. docker does not work even though the programme is running.

Why avoid full Linux distributions?

Using a complete Linux distribution for a standalone program is often excessive. An image from scratch is lighter and better adapted to our specific needs. However, it is essential to understand the term from scratch to avoid misunderstandings. GPT4o interprets this request as recreating a new Dockerfile from scratch, instead of creating a image starting with from scratch.

To get around this problem, we're giving a Dockerfile minimal to GPT4o with just the build and the run and we are asking him to complete it so that the Dockerfile works to launch the program compiled with cargo. File : Dockerfile FROM clux/muslrust:stable as builder FROM scratch

Prompt : Can you correct my Dockerfile so that it compiles my program with cargo and launches it correctly?

This operation can be carried out in a new discussion with GPT4o if required. There is no need to keep all the context, the Dockerfile will be sufficiently standard to ask it to modify it with the real names in a second generation stage.

SSL certificate problems and solutions

Some docker build / docker run and exchanges with GPT4o later, and here we are with a complete product. Note that a lot of time is wasted on a small detail of SSL certificates that GPT4o was unable to resolve. The following line helped us: COPY -from=builder /etc/ssl/certs/ca-certificates.crt /etc/ssl/certs/ Research with Google were needed to find it. This is probably due to the fact that Rust is not so widespread in the GPT4o learning bases and the images from scratch either.

The LLM failed on one key detail.

Functional, Unitary, Integration

For proper functioningwe can ask the model to generate json normally inserted in the body of the request http. To do this, we are building a prompt consisting of an example file json and tell it which variables to modify. This part cannot be carried out by GPT4o for confidentiality reasons, but it is a simple enough job to be done by llama3 7b Q4.

We can then quickly have json examples. It is possible to vary the format of numbers and their encapsulation in Thong/double quoteThe only limit is your imagination. The only limit is your imagination.

The only problem with functional testing is that very often we have to go back to the code correction stage after a test. With this method, where the code is generated or semi-generated, we don't perform unit tests on the functions in a framework integrated, we went straight to integration testing. There's no algorithm in our programme that's complicated enough to have twisted behaviours.

Performance

The easiest way to test the performance of a fast program is to call it several times and calculate its average running speed. The problem is that calls with curl are like this: while true; do time -p curl -i -X POST -data “@testfile.json” -H “User-Agent: while“ -H “authorization: X-API-KEY tokensecret“ -H “content-type: application/json" -H "Accept-Encoding: gzip, deflate, br“ ; ; done launches a process at each run, and the total execution time of the programme is shorter than that. The mockup has already shifted part of the slowness problem away from our main programme. mockup does not interact with a DB or with the network.

We therefore need to replace the while true curl by a faster program. Let's ask GPT4o for this generation from the test command line we have; and let's tell GPT4o that we don't want it to use the command curl. We obtain this program:

async fn main() {
let data = include_str!("testfile.json");
let json_data : serde_json::Value = serde_json::from_str(data).unwrap();
let client = Client::new();
loop{
    let start = Instant::now();
    let response = client.post("http://localhost:8000/grafana/alert")
        .header("User-Agent", "while")
        .header("authorization", "X-API-KEY tokensecret")
        .header("content-type", "application/json")
        .header("Accept-Encoding", "gzip, deflate, br")
        .body(json_data.to_string())
        .send()
        .await
        .expect("failed to get response")
        ;
    println!("Status: {} ", response.status());
}
}```

Once the program has been compiled and tested to ensure that it works properly, we ask GPT4o to add, step by step, a count of the number of iterations, a calculation of the execution time of each post, an average execution time and a rotation on the test files:

use request::Client; use serde_json::Value as JsonValue; use std::fs::read_to_string; use std::time::Instant;

async fn load_json_data(file_path: &str) -> JsonValue { let data = read_to_string(file_path).expect("Unable to read file"); serde_json::from_str(&data).expect("Unable to parse JSON") }

[async_std::main]

async fn main() { let mut iteration = 0; let mut total_duration = 0; let client = Client::new();

loop {
    iteration += 1;
    let json_index = (iteration - 1) % 9 + 1; // 1..9
    let file_path = format!("testfile{}.json", json_index);
    let json_data = load_json_data(&file_path).await;

    let start = Instant::now();
    let response = client
        .post("http://localhost:8000/grafana/alert")
        .header("User-Agent", "while")
        .header("authorization", "X-API-KEY tokensecret")
        .header("content-type", "application/json")
        .header("Accept-Encoding", "gzip, deflate, br")
        .body(json_data.to_string())
        .send()
        .await
        .expect("failed to get response");
    let duration = Instant::now() - start;
    total_duration += duration.as_nanos();
    println!(
        "Iteration: {} Status: {} Duration: {:.2?} \tmed:{}",
        iteration,
        response.status(),
        duration.as_nanos(),
        total_duration / iteration
    );
}
}```

We asked GPT4o to add the rotation to the test files long after we had asked it to count the iterations. We're pleased to see that it didn't recreate a variable, but instead used the variable iteration to create the rotation by applying a modulo.

Code formatting is an artefact of copying and pasting into VSCode (_Rust-analyser_) and then into the markdown editor.

We can now carry out performance tests between the original version and our final programme. The performance tests were carried out on a laptop with an _I7_12g. Each test was run over 10,000 iterations. The average execution time with our Python is 3,791,274 ns (nanoseconds). While the average execution time of our new Rust is 3,791,274 ns, 3.9 times faster.

We suspect that a large part of the difference in performance is due to the mockup program's limited ability to respond quickly enough for our new program. Let's turn it into Rust. We tried with llama3 70b on groq.com but it was very difficult to obtain a simple functional program. The same prompt offered to GPT4o enabled us to obtain a correct program on the first try with only a few import not used :

Rewrite the following program in Rust
Uses "actix
Does not use TcpListener::bind
Does not create a new "struct
Replaces static ids 666 & 999 with randomly generated numbers
from fastapi import FastAPI, Request
app = FastAPI()
@app.post("/v1/monitoringServices/notifications")
async def monitoring_service():
    return {"data": {"id": "999"}}
[...]
@app.get("/v1/hosts/{host_id}/services")
async def host_service(host_id: int, request: Request):
    return {"data": [{"id": "666"}]}

Here is the corrected text:

This is an approach in which we leave the prompt in the original Python program to document what happens to that program. The generated program is then placed in a folder next to it.

The details we provide on "Uses actix" or does not use "TcpListener::bind" are feedback from experience with llama3 70b.

The resulting programme meets the specifications and is not really any longer than its counterpart Python.

We can then run our tests again. The average execution time with our program Python is 2916122 ns. While the average execution time of our new program Rust is 250992 ns, 11.6 times faster. The size of the Docker image has been increased from 195 MB for the Python based on an image python:alpinethe size of the programme image standalone is 13.4 MB based on an image scratch.

Lighter, faster, with a smaller attack surface. We were able to transform and test a programme more easily than we would have done by hand.

Conclusions

Not all LLMs are created equal. Between speed, performance and confidentiality, you always have to ask yourself a few questions: what are we doing? why?

Speed : groq.com 1200 Token/s with llama3 7b

Performance : GPT4o on the AI portal developed by Cloud Temple, answers require less post-processing.

Confidentiality : locally depending on your llama3 or the mistral/_mixtral_.

The transformation took us between 3 and 4 working days, depending on how you count the interruptions.

Estimated development time

Without LLM: 12 days

With LLM: 4 days

We would never have embarked on such a transformation in a language with so little expertise if we hadn't had powerful language models to help us. Recruiting a developer on the basis of the programming languages they have mastered no longer seems relevant at all.

Limitations and lessons learned

Although LLMs offer significant gains, they are not infallible. Manual reviews remain essential to guarantee the quality and security of the code produced.

We have good expertise in many programming languages, but not in Rust. We can say that if we had to rewrite this program in Java or Javascriptit would have taken at least 3 times longer.

Using LLM triples our code productivity. Code production is not the only activity for developers, but it's still good business. And in terms of the environmental footprint, the energy consumption generated by these LLMs writing code for us is much lower than the coffee consumption that would have been achieved by tripling the working time.

Catégories

Tech culture

The magazine