Skip to content

Commit 549952f

Browse files
committed
Cleanup: Simplified Readme, improved logs, fixed NaN outputs and removed mailmap since it didn't affect GitHub's 'Insights' stats
1 parent c208927 commit 549952f

File tree

10 files changed

+57
-203
lines changed

10 files changed

+57
-203
lines changed

.mailmap

Lines changed: 0 additions & 4 deletions
This file was deleted.

README.md

Lines changed: 25 additions & 130 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Detect-File-Encoding-and-Language
22

3-
![npm](https://img.shields.io/npm/dw/detect-file-encoding-and-language)
3+
![npm](https://img.shields.io/npm/dm/detect-file-encoding-and-language)
44
![npm](https://img.shields.io/npm/v/detect-file-encoding-and-language)
55
![npm bundle size](https://img.shields.io/bundlephobia/min/detect-file-encoding-and-language)
66

@@ -18,182 +18,77 @@ Determine the encoding and language of text files!
1818

1919
For reliable encoding and language detection, use files containing 500 words or more. Smaller inputs can work as well but the results might be less accurate and in some cases incorrect.
2020

21-
Feel free to test the functionality of this NPM package [here](https://encoding-and-language-detector.netlify.app/). Upload your own files and see if the encoding and language are detected correctly!
22-
23-
## Index
24-
25-
- [Detect-File-Encoding-and-Language](#detect-file-encoding-and-language)
26-
- [Functionality](#functionality)
27-
- [Index](#index)
28-
- [Usage](#usage)
29-
- [In the browser](#in-the-browser)
30-
- [Using the script tag](#using-the-script-tag)
31-
- [Via CDN](#via-cdn)
32-
- [Via download](#via-download)
33-
- [Usage](#usage-1)
34-
- [Using a bundler](#using-a-bundler)
35-
- [Installation](#installation)
36-
- [Usage](#usage-2)
37-
- [In Node.js](#in-nodejs)
38-
- [Installation](#installation-1)
39-
- [Usage](#usage-3)
40-
- [In the terminal (CLI)](#in-the-terminal-cli)
41-
- [Installation](#installation-2)
42-
- [Usage](#usage-4)
43-
- [Supported Languages](#supported-languages)
44-
- [Used Encodings](#used-encodings)
45-
- [Confidence Score](#confidence-score)
46-
- [Known Issues](#known-issues)
47-
- [License](#license)
48-
49-
## Usage
50-
51-
There are several ways in which you can use this NPM package. You can use it as a [command-line interface](#in-the-terminal-cli), server-side [with Node.js](#in-nodejs) or client-side [in the browser](#in-the-browser).
21+
## Live Demo
5222

53-
### In the browser
54-
55-
In the body section of your html file, create an input element of type `file` and give it an id.
56-
57-
```js
58-
// index.html
59-
<body>
60-
<input type="file" id="my-input-field" />
61-
<script src="app.js"></script>
62-
</body>
63-
```
23+
Feel free to test the functionality of this NPM package [here](https://encoding-and-language-detector.netlify.app/). Upload your own files and see if the encoding and language are detected correctly!
6424

65-
Next, load the module either by [using the script tag](#using-the-script-tag) or by [using a bundler](#using-a-bundler)!
25+
## Installation
6626

67-
#### Using the script tag
27+
`npm install --save detect-file-encoding-and-language`
6828

69-
When loading it via the `<script>` tag, you can either use the CDN version or download the code itself and include it in your project. For a quickstart use the [CDN version](#via-cdn). If you want to be able to use it offline, [download and include it](#via-download)!
29+
## Usage
7030

71-
##### Via CDN
31+
### Script Tag
7232

7333
```js
7434
// index.html
75-
7635
<body>
7736
<input type="file" id="my-input-field" />
7837
<script src="https://unpkg.com/detect-file-encoding-and-language/umd/language-encoding.min.js"></script>
7938
<script src="app.js"></script>
8039
</body>
81-
```
8240

83-
Now that you've loaded the module, you can [start using it](#usage-1).
84-
85-
##### Via download
86-
87-
1. Create a new folder called `lib` inside your root directory
88-
2. Inside `lib` create a new file and call it `language-encoding.min.js`
89-
3. Make sure the encoding of your newly created file is either `UTF-8` or `UTF-8 with BOM` before proceeding!
90-
4. Go to https://unpkg.com/detect-file-encoding-and-language/umd/language-encoding.min.js and copy the code
91-
5. Paste it into `language-encoding.min.js` and save it
92-
6. Use the code below to load `language-encoding.min.js` via the `<script>` tag.
93-
94-
```js
95-
// index.html
96-
97-
<body>
98-
<input type="file" id="my-input-field" />
99-
<script src="lib/language-encoding.min.js"></script>
100-
<script src="app.js"></script>
101-
</body>
102-
```
103-
104-
##### Usage
105-
106-
The `<script>` tag exposes the `languageEncoding` function to everything in the DOM located beneath it. When you call it and pass in the file that you want to analyze, it'll return a Promise that you can use to retrieve the encoding, language and confidence score as shown in the example below.
107-
108-
```js
10941
// app.js
110-
111-
document
112-
.getElementById("my-input-field")
113-
.addEventListener("change", inputHandler);
114-
42+
document.getElementById("my-input-field").addEventListener("change", inputHandler);
11543
function inputHandler(e) {
11644
const file = e.target.files[0];
117-
11845
languageEncoding(file).then((fileInfo) => console.log(fileInfo));
11946
// Possible result: { language: english, encoding: UTF-8, confidence: { encoding: 1, language: 1 } }
12047
}
12148
```
12249

123-
#### Using a bundler
124-
125-
##### Installation
126-
127-
```bash
128-
$ npm install detect-file-encoding-and-language
129-
```
50+
If you don't want to use a CDN feel free to [download the source code](https://github.com/gignupg/Detect-File-Encoding-and-Language/wiki/Downloading-the-Source-Code)!
13051

131-
##### Usage
52+
### React and other frameworks
13253

13354
```js
134-
// app.js
135-
136-
const languageEncoding = require("detect-file-encoding-and-language");
137-
138-
document
139-
.getElementById("my-input-field")
140-
.addEventListener("change", inputHandler);
55+
// index.html
56+
<body>
57+
<input type="file" id="my-input-field" />
58+
<script src="app.js"></script>
59+
</body>
14160

61+
// app.js
62+
import languageEncoding from "detect-file-encoding-and-language";
63+
document.getElementById("my-input-field").addEventListener("change", inputHandler);
14264
function inputHandler(e) {
14365
const file = e.target.files[0];
144-
14566
languageEncoding(file).then((fileInfo) => console.log(fileInfo));
14667
// Possible result: { language: french, encoding: CP1252, confidence: { encoding: 1, language: 0.97 } }
14768
}
14869
```
14970

150-
> Note: This works great with frameworks such as React because they are doing the bundling for you. However, if you're using pure vanilla Javascript you will have to bundle it yourself!
151-
152-
### In Node.js
153-
154-
#### Installation
155-
156-
```bash
157-
$ npm install detect-file-encoding-and-language
158-
```
159-
160-
#### Usage
71+
### Node
16172

16273
```js
163-
// index.js
164-
74+
// server.js
16575
const languageEncoding = require("detect-file-encoding-and-language");
166-
16776
const pathToFile = "/home/username/documents/my-text-file.txt";
168-
16977
languageEncoding(pathToFile).then((fileInfo) => console.log(fileInfo));
17078
// Possible result: { language: japanese, encoding: Shift-JIS, confidence: { encoding: 0.94, language: 0.94 } }
17179
```
17280

173-
### In the terminal (CLI)
174-
175-
#### Installation
81+
### CLI
17682

17783
```bash
178-
$ npm install -g detect-file-encoding-and-language
179-
```
180-
181-
#### Usage
84+
# Installation
85+
npm install -g detect-file-encoding-and-language
18286

183-
Once installed you'll be able to use the command `dfeal` to retrieve the encoding and language of your text files.
184-
185-
```bash
186-
$ dfeal "/home/username/Documents/subtitle file.srt"
87+
# Usage
88+
dfeal "/home/username/Documents/subtitle file.srt"
18789
# Possible result: { language: french, encoding: CP1252, confidence: { encoding: 0.99, language: 0.99 } }
18890
```
18991

190-
or without quotation marks, using backslashes to escape spaces:
191-
192-
```bash
193-
$ dfeal /home/username/Documents/subtitle\ file.srt
194-
# Possible result: { language: french, encoding: CP1252, confidence: { encoding: 0.97, language: 0.97 } }
195-
```
196-
19792
## Supported Languages
19893

19994
- Polish

bin/cli.js

Lines changed: 5 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,13 @@
11
#!/usr/bin/env node
2-
32
const languageEncoding = require("../src/index-node.js");
43

54
const path = process.argv[2];
6-
5+
const notEnoughArguments = process.argv.length < 3;
76
const tooManyArguments = process.argv[3];
87

9-
if (tooManyArguments)
10-
console.log(
11-
"Error! Too many arguments passed in. Only one argument can be passed in. If your path or file name contain spaces, try to surround the whole file path with quotes!"
12-
);
8+
if (notEnoughArguments) console.error('Error: No argument passed in. Please pass in the file path as an argument! If the path contains spaces, surround it with quotes or use backslashes to escape spaces.');
9+
if (tooManyArguments) console.warn('Warning: Too many arguments passed in. Ignoring all extra arguments. Only one argument (the file path) can be passed in! If the path contains spaces, surround it with quotes or use backslashes to escape spaces.');
1310

1411
languageEncoding(path)
15-
.then((fileInfo) => {
16-
console.log(JSON.stringify(fileInfo, null, 4));
17-
})
18-
.catch((error) => {
19-
console.log(error);
20-
});
12+
.then((fileInfo) => console.info(JSON.stringify(fileInfo, null, 4)))
13+
.catch((error) => console.error(error))

package-lock.json

Lines changed: 2 additions & 2 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

package.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
{
22
"name": "detect-file-encoding-and-language",
3-
"version": "2.1.0",
3+
"version": "2.2.0",
44
"description": "Charset Detector - Detect the encoding and language of text files - Use it in the browser, with Node.js, or via CLI",
55
"main": "src/index-node.js",
66
"types": "src/index-node.d.ts",
77
"scripts": {
8-
"regextest": "node ./testing/regexTester.test.js",
9-
"test": "node ./testing/language-encoding.test.js",
8+
"file": "node ./bin/cli.js",
9+
"test": "node ./testing/subtitle-database.test.js",
1010
"build": "browserify ./src/index-browser.js --standalone languageEncoding > ./umd/language-encoding.min.js",
1111
"minify": "uglifyjs ./umd/language-encoding.min.js --compress --output ./umd/language-encoding.min.js",
1212
"prepublishOnly": "npm test"

src/components/processContent.js

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
const countAllMatches = require("./processing-content/countAllMatches.js");
22
const calculateConfidenceScore = require("./processing-content/calculateConfidenceScore.js");
3+
const byteOrderMarkObject = require("../config/byteOrderMarkObject.js");
34

45
module.exports = (data, fileInfo) => {
56
data.languageArr = countAllMatches(data, fileInfo.encoding);
@@ -31,6 +32,11 @@ module.exports = (data, fileInfo) => {
3132
if (!data.languageArr[data.pos].count) {
3233
fileInfo.language = null;
3334
fileInfo.confidence.language = null;
35+
36+
if (!byteOrderMarkObject.includes(fileInfo.encoding)) {
37+
fileInfo.encoding = null;
38+
fileInfo.confidence.encoding = null;
39+
}
3440
}
3541

3642
return fileInfo;

src/index-browser.js

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,9 +18,7 @@ module.exports = (file) => {
1818
const byteOrderMarkBuffer = new FileReader();
1919

2020
byteOrderMarkBuffer.onload = () => {
21-
const uInt8String = new Uint8Array(byteOrderMarkBuffer.result)
22-
.slice(0, 4)
23-
.join(" ");
21+
const uInt8String = new Uint8Array(byteOrderMarkBuffer.result).slice(0, 4).join(" ");
2422
const byteOrderMark = checkByteOrderMark(uInt8String);
2523

2624
if (byteOrderMark) {

src/index-node.d.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
export interface FileInfo {
2-
encoding: null | 'UTF-EBCDIC' | 'GB-18030' | 'GB18030' | 'UTF-32LE' | 'UTF-32BE' | 'UTF-8' | 'UTF-7' | 'UTF-1' | 'SCSU' | 'BOCU-1' | 'UTF-16BE' | 'UTF-16LE' | 'latin1' | 'ISO-8859-1' | 'CP1250' | 'CP1251' | 'CP1252' | 'CP1253' | 'CP1254' | 'CP1255' | 'CP1256' | 'CP1257' | 'BIG5' | 'Shift-JIS' | 'EUC-KR' | 'TIS-620';
2+
encoding: null | 'UTF-EBCDIC' | 'GB-18030' | 'UTF-32LE' | 'UTF-32BE' | 'UTF-8' | 'UTF-7' | 'UTF-1' | 'SCSU' | 'BOCU-1' | 'UTF-16BE' | 'UTF-16LE' | 'latin1' | 'ISO-8859-1' | 'CP1250' | 'CP1251' | 'CP1252' | 'CP1253' | 'CP1254' | 'CP1255' | 'CP1256' | 'CP1257' | 'BIG5' | 'Shift-JIS' | 'EUC-KR' | 'TIS-620';
33
language: null | 'polish' | 'czech' | 'hungarian' | 'romanian' | 'slovak' | 'slovenian' | 'albanian' | 'russian' | 'ukrainian' | 'bulgarian' | 'english' | 'french' | 'portuguese' | 'spanish' | 'german' | 'italian' | 'danish' | 'norwegian' | 'swedish' | 'dutch' | 'finnish' | 'serbo-croatian' | 'estonian' | 'icelandic' | 'malay-indonesian' | 'greek' | 'turkish' | 'hebrew' | 'arabic' | 'farsi-persian' | 'lithuanian' | 'chinese-simplified' | 'chinese-traditional' | 'japanese' | 'korean' | 'thai' | 'bengali' | 'hindi' | 'urdu' | 'vietnamese';
44
confidence: {
55
encoding: null | number;

testing/regexTester.test.js

Lines changed: 0 additions & 15 deletions
This file was deleted.

0 commit comments

Comments
 (0)