The Origin#

The origin of the story goes like this: PKU has a course titled “Music and Mathematics.” The final project requires us to develop two segments of random music into complete pieces.

I had always thought this task would be easy, until we actually started working on it and realized that no one in our group knew how to write sheet music, nor did anyone own a keyboard or a guitar to record it. Consequently, we began trying out various AI tools (honestly, we were counting on using AI from the very beginning).

Here came the problem: currently, almost all music AIs generate audio directly, and transcribing sheet music from audio is inherently difficult work. I tried giving Suno AI a main melody to generate a piece of music, but even a casual listen revealed at least ten voice parts. After separating the tracks and attempting to convert it into a MIDI file, the resulting sheet music was incredibly bizarre and irregular, and the melody was completely wrong.

So, this is the universal problem with current music AIs: the directly generated audio sounds great, but it is very difficult to convert back into sheet music (or rather, the difficulty of this task is extremely high).

Later, I discovered something called AIVI, which was rumored to generate sheet music directly. However, the quality of this tool was truly hard to praise. The music it generated based on the main melody I fed it had absolutely nothing to do with that melody, so it was a total failure.

The Core Technology#

Accidentally, a turning point appeared: while I was painstakingly clicking note by note in MuseScore to write the music by hand, I suddenly noticed a save format called musicxml. After saving, I found out that this file is actually a text file in XML format. In other words, a text model can read this file directly, and likewise, it can generate it.

Based on my confidence in the powerful GPT-5.5, I opened Codex and began to experiment. Below is my very first prompt:

Please read this musicxml file. This is a demo of a main melody. Please use this melody as the primary theme and develop it into a complete piece of instrumental music. It needs to include multiple voice parts, accompanied by chords. You can expand or transform the main melody to some extent, but this main melody must be highlighted in the final piece. Please directly output a listenable version of the musicxml file.

Then Teacher Codex actually generated a playable musicxml file, and it wasn’t half bad.

Everything after that was simple: I just needed to optimize the listening experience through continuous interaction with Codex.

Codex is actually very good at writing things like bass, pads, drums, and chords, which aligns perfectly with our intuition about large language models: after all, these elements are quite mechanical.

Therefore, the core technology is: there is a format called musicxml, so text models can write music too.

Technical Summary#

Below is some content I asked Codex to summarize based on our conversation, just for fun:

Recently, I conducted a rather interesting experiment: using a text model to assist in creating a piece of music.

The “text model” mentioned here is not dedicated AI arrangement software, nor is it the kind of tool where you input a prompt and it automatically spits out an MP3. Instead, it is a model like Codex, which primarily excels at understanding text, writing code, and modifying files. However, precisely because musical notation itself can be represented as structured text—such as MusicXML, MIDI events, or script generation rules—it can actually participate in music creation.

This article aims to document the process of this experiment: how I picked a theme from a randomly generated melody and, together with the text model, developed it into a relatively complete short piece of instrumental music.

1. Music Can Also Be a Form of “Text”#

When many people first think of AI-generated music, they think of generating audio directly. But this time, I took a different path:

Melody Material -> MIDI / MusicXML -> Text Model Analysis and Modification -> Sheet Music Generation -> Playback Audition -> Continued Feedback

In other words, the model does not directly “sing” the music out; rather, it creates by generating and modifying sheet music files.

MusicXML is essentially an XML file that records information such as notes, durations, voice parts, instruments, chords, and dynamics. MIDI can also be parsed into note events. To a text model, although these things are not natural language, they are still structured text. As long as they are paired with scripts, they can be read, analyzed, and generated.

This gives us a very interesting way to create: humans are responsible for listening, judging, and proposing directions, while the model is responsible for quickly implementing the modifications on the score.

2. Picking a “Seed” from Random Music#

The raw material for this experiment was a segment of randomly generated pink music.

Randomly generated music often has a problem: it might have certain local parts that sound very interesting, but the whole thing does not necessarily sound like a real piece. Therefore, the first step was not to let the model expand the entire song directly, but to first pick a potential snippet from the random results.

Together with the model, I analyzed the original melody and ultimately selected a snippet from measure 12 to measure 16—about 5 measures long—as the theme.

This melody had a few characteristics:

  • The rhythm was very jumpy, not a smooth, singing-style melody;
  • There were quite a few sudden leaps;
  • The length was 5 measures, which was a bit asymmetrical;
  • Although it came from random generation, it had memorable motives locally.

It wasn’t as regular as traditional classical melodies, but precisely because of its irregularity, it possessed a unique sense of randomness and agility.

So the question became: what style is suitable for developing such a melody?

3. Do Not Force It to Be Classical#

At first, I also tried letting the model make demos in a few different styles, such as a piano piece, jazz/funk, electronic groove, etc.

After listening, it was very obvious: this melody was not quite suitable for being written into a traditional, overly formal classical piece. Because its own charm lay in its rhythmic jumps and irregular phrasing, forcing a steady classical accompaniment onto it actually made it sound awkward.

Later, the direction became clearer: it was more suitable for a modern electronic groove or video game soundtrack style.

This style offers several advantages:

  • A stable drum beat can anchor the jumping melody;
  • A bass groove can provide a sense of direction;
  • A synth lead can make the theme sound bright;
  • The irregularity of the theme will not be viewed as a defect, but will instead become its personality.

This is also a very important point when creating with text models: do not just ask, “Can you help me expand this?” Instead, constantly judge, “What is this material itself suitable for?”

4. Let the Model Generate Scores, Not Just Give Advice#

Once the direction was determined, I had the model start generating MusicXML.

The initial versions usually weren’t particularly good. The model could quickly write a complete structure, such as:

Intro - A - Transition - B - Breakdown - Development - Climax - Coda

It could also generate multiple voice parts such as bass, pads, drums, chords, and the main melody. But the problem was that the first generated output was often “structurally complete, but auditorily flat.”

This is the typical state of text models creating music: they are great at building frameworks, but they might not judge on the first try where it drags, where it is monotonous, or where it lacks a climax.

So the truly effective way of working is not “generating the final piece in one go,” but rather:

Generate a version -> Listen -> Point out problems -> Modify -> Listen again -> Modify again

This is a bit like collaborating with an arrangement assistant who never gets tired.

5. Iteration Is More Important Than Prompts#

The parts where this creation truly improved basically all came from iteration.

For example, an early version had a B section and a transition, which looked reasonable structurally but sounded very long and lacked variation. After I pointed this out, the model shortened the overall length, cut out redundant sections, and brought the development forward, making it a tighter structure.

Later, I noticed that the climax was not prominent enough. The model’s initial approach was simply to raise the theme by an octave and thicken the voice parts, but it didn’t sound like it truly “lifted.” So I requested it to rework the connection from the development to the climax—instead of putting a long breakdown before the climax, change it to a pre-climax bridge that takes a brief breath and then dives straight into the climax.

Still later, I found that the main melody sounded too similar every time it appeared, feeling like a copy-paste job. I then requested the model to make slight variations: keep the original theme for section A, add a G-key color to the development, and change the climax to a high-register variation rather than a simple octave repetition.

None of these modifications were achieved through a magical prompt; they were achieved by repeatedly listening and repeatedly pointing out problems.

6. Humans Cover Aesthetics, Models Cover Implementation#

My biggest takeaway from this experiment is that text models are well-suited to be “implementers,” but they should not be entirely relied upon as “aesthetic judges.”

For instance, it is very good at:

  • Writing MusicXML;
  • Generating multi-part scores;
  • Modifying chords based on requirements;
  • Adjusting bass patterns;
  • Modifying drum beats;
  • Generating different versions in batches;
  • Keeping the file structure correct.

But it is not always good at:

  • Judging whether a section is too long;
  • Judging whether a climax truly has emotional progression;
  • Judging whether the main melody is drowned out by the accompaniment;
  • Judging whether a certain section sounds like a boring log;
  • Judging whether the overall style is unified.

These require human listening.

In this creation, my role was more like a director: I didn’t necessarily have to handwrite every single voice part, but I could judge that “this place is too fragmented,” “this place is too empty,” “the main melody here is unclear,” “there aren’t enough drums here,” or “the harmony here is too monotonous.” The model was then responsible for putting these judgments into effect within the score.

7. Why MusicXML Is Great for This Collaboration#

If you just let the model generate audio directly, making modifications becomes quite difficult. You might only be able to say “make this part a bit more intense” or “make this part gentler,” but what exactly happens is hard to control.

The benefit of MusicXML is that it is highly editable:

  • If you want to change a certain voice part, you can directly modify the corresponding part;
  • If you want to adjust a specific measure, you can locate it via measure;
  • If you want to alter a chord, you can change the harmony;
  • If you want to adjust the main melody, you can change the note;
  • If you want to add dynamics, you can write in mf or f;
  • If you want to add expressions, you can write in accent, tenuto, or staccato.

In other words, the music is disassembled into an operable structure.

This is extremely friendly to text models because they are inherently good at handling structured text and code. Many times, I wasn’t letting it “create music out of thin air,” but rather letting it write scripts to generate scores. This way, we could iterate quickly while maintaining controllability.

8. Rough Structure of the Final Work#

The final version ended up being around 40 measures long, with the structure:

Intro pulse
Introductory snippet
A - varied theme
Transition / gentle lift
Development - G color
Pre-climax bridge
Climax - varied theme in G
Coda - return to C

The instrumentation roughly included:

  • Main melody synth lead;
  • Support slightly thickening the main melody;
  • Mallet ostinato;
  • Chord hits;
  • Warm pad;
  • Expressive bass;
  • Drum kit.

The overall style can be summarized as: a short video game soundtrack piece in a modern electronic groove style.

It preserved the jumpy feel of the original random melody, while organizing it into a relatively complete work through drums, bass, harmony, and structure.

9. Who Is This Method Suitable For?#

I think this method is particularly suitable for a few types of people:

  1. People who have melodic ideas but do not know how to do a full arrangement;
  2. People who can judge the listening experience but are unfamiliar with MusicXML or orchestration;
  3. People who want to develop randomly generated material into a complete work;
  4. People who want to quickly test out many versions;
  5. People who know a bit of code and are willing to treat music as structured data.

It won’t necessarily generate professional-grade works directly, but it is excellent for drafts, experimentation, and rapid iteration.

10. Summary of Some Experiences#

After this experiment, I feel that the more effective approach is:

  • Do not let the model write a complete song right from the start; pick a theme first;
  • Determine what style fits the theme first;
  • Let the model generate demos in multiple directions;
  • Make each piece of feedback as specific as possible, such as “the climax is not prominent enough,” “the bass is too mechanical,” or “the main melody is drowned out”;
  • Do not be afraid of iterating through multiple versions;
  • In the end, you must rely on the human ear to judge, rather than just looking at the score’s structure.

Text models generating musical scores is essentially not “one-click composing,” but rather a new way of collaboration.

It lowers many technical barriers: writing scores, expanding voice parts, generating MusicXML, and adjusting structures can all be completed very quickly. However, whether the music ultimately works still depends on human auditory judgment and aesthetic choices.

This is probably the most interesting part of current AI creation: it doesn’t do everything for you, but rather allows you to try, fail, modify faster, and gradually get closer to what you want.