Skip to content

Commit 15b84cc

Browse files
committed
20250907_00-Release - CodExorcism Release - Not just for Codex
- Expanded quote normalization: map additional Unicode quote/prime/angle/fullwidth marks to ASCII ' and " for shell-safe output - Refined VS Code filter handling: only apply newline compensation in filter mode; never in file-write modes; respect CI/CD env - Normalize Unicode spaces: replace NBSP (U+00A0), NARROW NBSP (U+202F), EN/EM/THIN spaces (U+2000–U+200A), IDEOGRAPHIC SPACE (U+3000), etc., with ASCII space - Remove bidi/zero-width controls: strip LRM/RLM, embeddings/overrides/isolates, ZWSP/ZWNJ/ZWJ, BOM - Note: These artifacts were observed in content produced by Codex/VS Code extensions - No breaking changes; behavior unchanged for already-clean inputs - Ellipsis handling and normalization
1 parent d8b4602 commit 15b84cc

File tree

6 files changed

+185
-109
lines changed

6 files changed

+185
-109
lines changed

CHANGELOG.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,15 @@
22

33
## 2025-08-12
44

5-
### Minor patch
5+
### **CodExorcism Release - Not just for Codex**
6+
67
- Expanded quote normalization: map additional Unicode quote/prime/angle/fullwidth marks to ASCII ' and " for shell-safe output
78
- Refined VS Code filter handling: only apply newline compensation in filter mode; never in file-write modes; respect CI/CD env
9+
- Normalize Unicode spaces: replace NBSP (U+00A0), NARROW NBSP (U+202F), EN/EM/THIN spaces (U+2000–U+200A), IDEOGRAPHIC SPACE (U+3000), etc., with ASCII space
10+
- Remove bidi/zero-width controls: strip LRM/RLM, embeddings/overrides/isolates, ZWSP/ZWNJ/ZWJ, BOM
11+
- Note: These artifacts were observed in content produced by Codex/VS Code extensions
812
- No breaking changes; behavior unchanged for already-clean inputs
13+
- Ellipsis handling and normalization
914

1015
## 2025-07-28
1116

README.md

Lines changed: 67 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,28 @@
1-
# UnicodeFix
1+
# UnicodeFix - *CodExorcism Edition*
22

33
![UnicodeFix Hero Image](docs/controlling-unicode.png)
44

5-
- [UnicodeFix](#unicodefix)
5+
[![Python](https://img.shields.io/badge/Python-3.10%2B-blue)](#) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](LICENSE) [![Release](https://img.shields.io/github/v/tag/unixwzrd/UnicodeFix?label=release)](https://github.com/unixwzrd/UnicodeFix/releases)
6+
7+
- [UnicodeFix - *CodExorcism Edition*](#unicodefix---codexorcism-edition)
68
- [**Finally - a tool that blasts AI fingerprints, torches those infuriating smart quotes, and leaves your code \& docs squeaky clean for real humans.**](#finally---a-tool-that-blasts-ai-fingerprints-torches-those-infuriating-smart-quotes-and-leaves-your-code--docs-squeaky-clean-for-real-humans)
79
- [Why Is This Happening?](#why-is-this-happening)
810
- [Installation](#installation)
911
- [Usage](#usage)
12+
- [New options](#new-options)
13+
- [When to preserve invisible characters (`-i`)](#when-to-preserve-invisible-characters--i)
1014
- [Brief Examples](#brief-examples)
1115
- [Pipe / Filter (STDIN to STDOUT)](#pipe--filter-stdin-to-stdout)
1216
- [Batch Clean](#batch-clean)
1317
- [In-Place (Safe) Clean](#in-place-safe-clean)
1418
- [Preserve Temp File for Backup](#preserve-temp-file-for-backup)
1519
- [Using in vi/vim/macvim](#using-in-vivimmacvim)
1620
- [What's New / What's Cool](#whats-new--whats-cool)
21+
- [CodexExorcism Release (Sept 2025)](#codexexorcism-release-sept-2025)
22+
- [Previous Releases](#previous-releases)
23+
- [Keep It Fresh!](#keep-it-fresh)
1724
- [Shortcut for macOS](#shortcut-for-macos)
18-
- [To add the Shortcut:](#to-add-the-shortcut)
25+
- [To add the Shortcut](#to-add-the-shortcut)
1926
- [What's in This Repository](#whats-in-this-repository)
2027
- [Testing and CI/CD](#testing-and-cicd)
2128
- [Contributing](#contributing)
@@ -42,6 +49,8 @@ Nearly a thousand people have grabbed it. Nobody's bought me a coffee yet, but h
4249

4350
Some folks think all this Unicode cruft is a side-effect of generative AI's training data. Others believe it's a deliberate move - baked-in "watermarks" to ID machine-generated text. Either way: these artifacts leave a trail. UnicodeFix wipes it.
4451

52+
Be careful, professors and reviewers may even start planting Unicode honeypots in starter code or essays - UnicodeFix torches those too. In this "AI Arms Race," `diff` and `vimdiff` are your night-vision goggles.
53+
4554
---
4655

4756
## Installation
@@ -55,6 +64,7 @@ bash setup.sh
5564
```
5665

5766
The `setup.sh` script:
67+
5868
- Creates a Python virtual environment just for UnicodeFix
5969
- Installs dependencies
6070
- Adds handy startup config to your `.bashrc` for one-command usage
@@ -69,56 +79,79 @@ For serious environment nerds: [VenvUtil](https://github.com/unixwzrd/venvutil)
6979

7080
Once installed and activated:
7181

72-
```
73-
(python-3.10-PA-dev) [unixwzrd@xanax: UnicodeFix]$ cleanup-text --help
82+
```bash
83+
(LLaSA-speech) [unixwzrd@xanax: bin]$ cleanup-text --help
7484

75-
usage: cleanup-text [-h] [-i] [-o OUTPUT] [-t] [-p] [-n] [infile ...]
85+
usage: cleanup-text [-h] [-i] [-Q] [-D] [-n] [-o OUTPUT] [-t] [-p] [infile ...]
7686

77-
Clean Unicode quirks from text. If no input files are given, reads from STDIN and writes to STDOUT (filter mode). If input files are given, creates cleaned files with .clean before the extension (e.g., foo.txt -> foo.clean.txt). Use -o - to force output to STDOUT for all input files, or -o <file> to specify a single output file
78-
(only with one input file).
87+
Clean Unicode quirks from text. If no input files are given, reads from STDIN and writes to STDOUT (filter mode). If input files are given, creates cleaned files with .clean before the extension (e.g., foo.txt -> foo.clean.txt). Use -o - to force output to STDOUT for all input files, or -o <file> to specify a single output file (only with one
88+
input file).
7989

8090
positional arguments:
8191
infile Input file(s)
8292

8393
options:
8494
-h, --help show this help message and exit
85-
-i, --invisible Preserve invisible Unicode characters (zero-width, non-breaking, etc.)
95+
-i, --invisible Preserve invisible Unicode characters (zero-width, bidi controls, etc.)
96+
-Q, --keep-smart-quotes
97+
Preserve Unicode smart quotes (do not convert to ASCII)
98+
-D, --keep-dashes Preserve Unicode EN/EM dashes (do not convert to ASCII)
99+
-n, --no-newline Do not add a newline at the end of the output file (suppress final newline).
86100
-o OUTPUT, --output OUTPUT
87101
Output file name, or '-' for STDOUT. Only valid with one input file, or use '-' for STDOUT with multiple files.
88102
-t, --temp In-place cleaning: Move each input file to .tmp, clean it, write cleaned output to original name, and delete .tmp after success.
89103
-p, --preserve-tmp With -t, preserve the .tmp file after cleaning (do not delete it). Useful for backup or manual recovery.
90-
-n, --no-newline Do not add a newline at the end of the output file (suppress final newline).
91104
```
92105

106+
### New options
107+
108+
- `-Q`, `--keep-smart-quotes`: Preserve Unicode smart quotes (curly single/double quotes). Useful when preparing prose/blog posts where typographic quotes are intentional. Default behavior converts them to ASCII for shell/CI safety.
109+
- `-D`, `--keep-dashes`: Preserve EN/EM dashes. Useful when stylistic punctuation is desired in prose. Default behavior converts EM dash to ` - ` and EN dash to `-`.
110+
111+
#### When to preserve invisible characters (`-i`)
112+
113+
In most code/CI workflows, invisible/bidi controls are accidental and should be removed (default). Rare cases to preserve (`-i`):
114+
115+
- Linguistic text where ZWJ/ZWNJ influence shaping
116+
- Intentional watermarks/markers in text
117+
- Forensic/debug inspections before deciding what to strip
118+
93119
## Brief Examples
94120

95121
### Pipe / Filter (STDIN to STDOUT)
96-
```
122+
123+
```bash
97124
cat file.txt | cleanup-text > cleaned.txt
98125
```
99126

100127
### Batch Clean
101-
```
128+
129+
```bash
102130
cleanup-text *.txt
103131
```
104132

105133
### In-Place (Safe) Clean
106-
```
134+
135+
```bash
107136
cleanup-text -t myfile.txt
108137
```
109138

110139
### Preserve Temp File for Backup
111-
```
140+
141+
```bash
112142
cleanup-text -t -p myfile.txt
113143
```
114144

115145
### Using in vi/vim/macvim
116146

117-
```
147+
```vim
118148
:%!cleanup-text
119149
```
120150

121-
You can run it from Vim, VS Code in Vim mode, or as a pre-commit. Use it for email, blog posts, whatever. Ignore the naysayers - this is *real-world convenience.*
151+
Works great for vi/Vim purists, VS Code hipsters, or anyone who just wants their text to behave like text.
152+
Also handy if you’re trying to slip your AI-generated code past your CS prof without curly quotes giving you away.
153+
154+
You can run it from Vim, VS Code in Vim mode, or as a pre-commit. Use it for email, blog posts, whatever. Ignore the naysayers - this is _real-world convenience._
122155

123156
See [cleanup-text.md](docs/cleanup-text.md) for deeper dives and arcane options.
124157

@@ -129,15 +162,28 @@ See [cleanup-text.md](docs/cleanup-text.md) for deeper dives and arcane options.
129162

130163
## What's New / What's Cool
131164

132-
- **Vaporizes invisible Unicode (unless you tell it not to)**
165+
### CodexExorcism Release (Sept 2025)
166+
167+
Exorcise your code from VS Code/Codex’s funky Unicode artifacts (NBSPs, bidi controls, smart quotes).
168+
169+
- **Safer EOF handling in VS Code filter mode**
170+
- **Normalizes more sneaky Codex/AI fingerprints**
171+
- **Ellipsis Eradication**
172+
173+
### Previous Releases
174+
133175
- **Normalizes EM/EN dashes to true ASCII - no more AI " - " nonsense**
134176
- **Wipes AI "tells," watermarks, and digital fingerprints**
135177
- **Fixes trailing whitespace, normalizes newlines, burns the digital junk**
136178
- **Portable (Python 3.7+), cross-platform**
137179
- **Integrated macOS Shortcut for right-click cleaning in Finder**
138180
- **Can be used in CI/CD - but also by normal humans, not just pipeline freaks**
139181

140-
> *Fun fact*: Even Python will execute code with "curly quotes." Your IDE, email client, and browser all sneak these in. UnicodeFix hunts them down and torches them.
182+
> *Fun fact*: Even Python will execute code with "curly quotes." Your IDE, email client, and browser all sneak these in. UnicodeFix hunts them down and torches them, ...so your coding homework looks *lovingly hand-crafted* at 4:37 a.m., rather than LLM spawn.
183+
184+
### Keep It Fresh!
185+
186+
Pull requests/issues always welcome - especially if your AI friend slipped a new weird Unicode gremlin past me, I found a few more while preparing this release too...🙄
141187

142188
---
143189

@@ -147,7 +193,7 @@ UnicodeFix ships with a macOS Shortcut for direct Finder integration.
147193

148194
Right-click files, pick a Quick Action, and - bam - no terminal required.
149195

150-
### To add the Shortcut:
196+
### To add the Shortcut
151197

152198
1. Open the **Shortcuts** app.
153199
2. Choose `File -> Import`.
@@ -203,14 +249,13 @@ Pull requests with attitude, creativity, and clean diffs appreciated.
203249

204250
## Support This and Other Projects
205251

206-
If UnicodeFix (or my other projects) saved your bacon or made you smile,
207-
please consider fueling my caffeine habit and indie dev obsession:
252+
If UnicodeFix (or my other projects) saved your bacon or made you smile, please consider fueling my caffeine habit and indie dev obsession...
208253

209254
- [Patreon](https://patreon.com/unixwzrd)
210255
- [Ko-Fi](https://ko-fi.com/unixwzrd)
211256
- [Buy Me a Coffee](https://buymeacoffee.com/unixwzrd)
212257

213-
One coffee = one more tool released to the wild.
258+
Quite a bit of effort goes into preparing these releases. *One coffee = one more tool released to the wild...*🤔
214259

215260
Thank you for keeping solo development alive!
216261

0 commit comments

Comments
 (0)