A small data cleaning experiment.

Data

I keep a file of my favorite quotes, amassing 7 pages of them over the years. My guess is that there’s something like a hundred in there.

I have to guess because had I been more thoughtful when I first started keeping them, I might have put them into a proper database (or at least a numbered list). Instead quote text was copied & pasted into a flat file of “semi-structured but mixed format” 1.

You can peruse the raw data here. I can almost discern a kind of “geological strata” from different archaeological periods of collecting these data.

Method

The file is not that big but just big enough that I have little interest in cleaning it up myself, either by hand (wouldn’t take long but no fun) or by writing code to parse it (also yawn) 2.

Instead let’s evaluate how well the current generation of chatbot/copilot handles basic cleaning with only a modest amount of guidance/instruction.

Results

ChatGPT

Started by giving ChatGPT 5.2 the following prompt:

I have collected my favorite quotes in a file. Most of the quotes were copy and pasted into this file with the quote text first followed by the author (or some other reference or attribution). Many of the quotes are a single line while others are multi-line with some attempt to preserve whitespace for readability. Some are just links (usually as a URL meant for a web browser). If I paste the file data here (or attach as a file, whatever you prefer) can you help me parse these data and output the cleaned up quotes (with proper attribution when available) to the Linux fortune file format?

As with most oneshot attempts, the immediate results seem promising but fall apart quickly upon scrutiny.

In other words, I was happy at first with initial output since it was not obvious that it was only partial. Then ChatGPT offered to provide “all the quotes” because it did some “trimming”. Naturally I asked for the complete output along with a total quote count (chatbot replied “153”, which seemed high).

Ended up being too messy to try to evaluate/iterate/fix in a chat context, so I did not waste time here and pivoted quickly to a copilot which should work better for this task-oriented request.

GitHub Copilot/Claude Haiku

I took the ^^ prompt and turned it into a half-page of AGENTS.md guidance for GitHub Copilot (Auto picked Claude Haiku 4.5). There was an “analyze & plan” button so I clicked that first 3, which created an agent plan at .github/copilot-instructions.md.

Told it “Plan seems fine, please proceed to implement”. Some notes:

  • QC report said it found 80 total quotes (seemed a bit low).
  • Spurred me to find at least one missing quote (Lombardi placeholder).
  • Also agent complained about not being able to run Python scripts in the terminal 4 so it modified the file “in place” or something.

Humm. o_o

GitHub Copilot/GPT-5 mini

Based on my above experience where the copilot tried to script (twice) but complained that it could not execute the scripts, I decided to offer additional guidance to help nudge it along the route it wanted to take instead of falling back to something (it thought) was inferior.

There was no Codespaces GUI button this time so I pasted the saved prompt 5, which created a different-looking .github/copilot-instructions.md 6. More notes:

  • Agent was able to successfully run its scripts this time.
  • Output had problems but it only took a couple quick interactions to point out where they were 7.
  • The second set of fixes would have been faster by hand but I wanted to see what the agent would do 8.

Discussion

Overall, not too much additional work on my part to successfully nudge Copilot to the finish line.

While I could further discuss the above results and those of subsequent steps (e.g., asking copilot to check source attribution 9), I’m not trying to get published in a peer-reviewed journal here so I’ll simply leave the gentle reader with the below.

Self-guided Tour

Regarding the repo, I did try to tag commits such that fellow practicioners could visit these waypoints and perhaps more readily locate themselves in the space/time of git (and dig further if so inclined).

For example see v0.1 tag for inputs, intermediate working material, and outputs at that point in development.

Does It Work

While the file format is hard to get wrong, it is certainly possible that fortune can’t parse what Copilot produced. Let’s try installing and running fortune on the freshly reformatted file.

$ fortune quotes
It is always wise to look ahead, but difficult to look further than you can see.
- Winston Churchill
$ fortune quotes
Nothing is easier than self-deceit. For what each man wishes, that he also believes
 to be true. - Demosthenes
$ fortune quotes
"Doubt is not a pleasant condition, but certainty is absurd." - Voltaire
$ fortune quotes
"Don't worry about the world coming to an end today. It's already tomorrow in
 Australia." - Charles M. Schulz (1922-2000)
$ fortune quotes
The only normal people are the ones you don't know very well. - Joe Ancis
$

I do still enjoy many of these quotes.

Why fortune?

So why fortune instead of, say, a database or at least a more capable flat file format??

Well, I spent time in Dec hacking some boxes (part Advent of Cyber 2025 10, part knocking off some offsec/CTF rust) when I realized I hadn’t seen a MOTD 11 at login for a long while and got an unexpected wave of nostalgia about random fortunes.

Quirky messages of the day were a fun way for a sysadmin’s personality to shine through to his or her users.

– JW

Footnotes

  1. That’s what I told copilot anyways. :-D 

  2. While this could be a straightforward task for a student or intern, I don’t see myself asking another human being to clean up my quotes collection. 

  3. That button just generates the following prompt

  4. Not unlike a real intern, who sometimes have issues getting things they build to run on their computer. 

  5. Perhaps better for reproducibility, at least to the extent a non-deterministic process can be. Also I git rm’d the .github/ dir and output from the previous attempt to (hopefully?) provide a clean slate. 

  6. Auto had switched me to a different model (GPT-5 mini). I fixed an obvious problem in its “Actionable parsing guidance for agents” but the updated plan otherwise seemed fine. 

  7. There are 122 total quotes (supposedly). At some point I could go to the raw input and figure out how many quotes I actually have. 

  8. Agent updated the script again. Arguably not worthwhile for a one-off parser that probably won’t be run again. 

  9. A clumsy stab at it anyways. May revisit in the future. 

  10. Finally committed to doing one of these. While it turned out to be a decent way for me to survey what was going on in other areas of cyber, the back half of the month got pretty tedious. No fault of the organizers or authors of the challenges. Likely just missed my window. 

  11. Message of the day.