Skip to content

Commit b926d24

Browse files
authored
[ANE-2616] Snippet Scan docs (#1615)
1 parent e2211ce commit b926d24

File tree

4 files changed

+144
-61
lines changed

4 files changed

+144
-61
lines changed

Changelog.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
## 3.13.1
44
- Add a summary of the snippet scan when the `--x-snippet-scan` flag is used ([#1613](https://github.com/fossas/fossa-cli/pull/1613))
5+
- Update snippet scanning documentation ([#1615](https://github.com/fossas/fossa-cli/pull/1615))
56

67
## 3.13.0
78
- Change how debug logs are generated. They are now generated in a file called fossa.debug.zip, which can contain multiple files. For the common case of `fossa analyze --debug`, it will now contain the debug bundle (fossa.debug.json) and the telemetry json (fossa.telemetry.json). It will also contain Ficus logs if Ficus is run via --x-snippet-scan ([#1610](https://github.com/fossas/fossa-cli/pull/1610))

docs/features/snippet-scanning.md

Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
2+
# Snippet Scanning
3+
4+
Snippet scanning identifies potential open source code snippets within your first-party source code by comparing file fingerprints against FOSSA's knowledge base. This feature helps detect code that may have been copied from open source projects.
5+
6+
Snippet Scanning runs as part of `fossa analyze`. To enable it, add the `--x-snippet-scan` flag when you run `fossa analyze`:
7+
8+
```
9+
fossa analyze --x-snippet-scan
10+
```
11+
12+
Snippet Scanning must also be enabled for your organization, and is only available for enterprise customers. If you would like to enable it for your organization, please [contact us](https://support.fossa.com).
13+
14+
## How Snippet Scanning Works
15+
16+
When `--x-snippet-scan` is enabled, the CLI:
17+
18+
1. **Hashes Files First**: Creates CRC64 hashes of all source files to identify which files need fingerprinting
19+
2. **Checks Necessity of Fingerprinting**: Checks with FOSSA servers to determine which file hashes are already known
20+
3. **Fingerprints New or Changed Files**: Uses the Ficus fingerprinting engine to create cryptographic fingerprints only for files not previously seen
21+
4. **Filters Content**: By default, skips directories like `.git/`, and hidden directories. This includes, from `.fossa.yml`, `vendoredDependencies.licenseScanPathFilters.exclude`, documented further below.
22+
5. **Uploads Fingerprints**: Sends only the fingerprints to FOSSA's servers
23+
6. **Receives Matches**: Gets back information about any matching open source components
24+
7. **Uploads Match Contents**: For files that have matches, uploads source code content temporarily to FOSSA servers (see below for more details on what we send and how long it is retained for).
25+
26+
## Data Sent to FOSSA
27+
28+
**For Performance Optimization:**
29+
- CRC64 hashes of all files, to avoid re-fingerprinting unchanged files.
30+
31+
**For Fingerprinting:**
32+
- Fingerprints of source code to identify matches.
33+
34+
**For Matched Files Only:**
35+
- The full content of all files that contain snippet matches.
36+
37+
## Data Retention
38+
39+
- **File Fingerprints**: Stored permanently for caching and performance optimization
40+
- **Source Code Content**: Stored temporarily for 30 days and then automatically deleted
41+
- **CRC64 Hashes**: The likelihood of a collision with CRC64 (2^64 possible values) is extremely low.
42+
43+
## Directory Filtering
44+
45+
By default, snippet scanning excludes common non-production directories and follows `.gitignore` patterns:
46+
47+
- Hidden directories.
48+
- Globs as directed by `.gitignore` files.
49+
50+
#### Custom Exclude Filtering
51+
52+
You can customize which files and directories are excluded from snippet scanning by configuring exclude filters in your `.fossa.yml` file. Note that snippet scanning currently only supports exclude patterns, not `only` patterns.
53+
54+
For example:
55+
```yaml
56+
version: 3
57+
vendoredDependencies:
58+
licenseScanPathFilters:
59+
exclude:
60+
- "**/test/**"
61+
- "**/tests/**"
62+
- "**/spec/**"
63+
- "**/node_modules/**"
64+
- "**/dist/**"
65+
- "**/build/**"
66+
- "**/*.test.js"
67+
- "**/*.spec.ts"
68+
```
69+
70+
**Important Notes:**
71+
72+
- Snippet scanning only uses the `exclude` filters from `licenseScanPathFilters` - `only` filters are ignored for this use-case.
73+
- Path filters use standard glob patterns (e.g., `**/*` for recursive matching, `*` for single-directory matching).
74+
- The configuration goes in the `vendoredDependencies.licenseScanPathFilters.exclude` section.
75+
- These exclude patterns are passed directly to the Ficus fingerprinting engine as `--exclude` arguments.
76+
- Default exclusions (hidden files, `.gitignore` patterns) are applied in addition to custom excludes.
77+
78+
## A note on scan times
79+
80+
The first time you run a snippet scan on a codebase, it may take a long time to scan. For example, scanning [Linux](https://github.com/torvalds/linux) for the first time takes around 60 minutes. This is because most of the files in your codebase will not exist in FOSSA's knowledge base, and we will need to fingerprint and compare all of them to our snippet scan corpus.
81+
82+
However, the next time you scan that codebase we will only need to re-fingerprint and compare files that have changed since the previous scan, and the scan will be much faster. For example, if you snippet scan that same revision of Linux a second time, the scan will complete in less than a minute.
83+
84+
Because of this speed difference, we recommend doing a manual scan of your project before enabling Snippet Scanning in CI. This will avoid running multiple slow scans, as any scans started before the first scan completes will also be slower.
85+
86+
The time it takes to scan newer versions of your codebase will depend on how many files in the new version have not been previously scanned. A file has been previously scanned if the exact same file has ever been snippet scanned. FOSSA recommends snippet scanning your codebase on a regular basis to keep scan times low.
87+
88+
## The Snippet Scan Summary
89+
<!-- Note: this section is linked to from the snippet scan summary in src/App/Fossa/Ficus/Analyze.hs. So if you change this heading name or the path
90+
to this file, you will need to update the link there as well -->
91+
92+
When a Snippet Scan completes, the CLI will output a summary of the scan. It will look like this:
93+
94+
```
95+
============================================================
96+
Snippet scan summary:
97+
Analysis ID: 110054
98+
Bucket ID: 110551
99+
Files skipped: 6
100+
Total Files processed: 18
101+
Unique Files processed: 13
102+
Unique Files with matches found: 4
103+
Unique Files with no matches found: 9
104+
Unique Files already in our knowledge base: 11
105+
Unique Files new to our knowledge base: 2
106+
Processing time: 0.087s
107+
============================================================
108+
```
109+
110+
Here is a description of what each line means:
111+
112+
<dl>
113+
<dt>Analysis ID</dt>
114+
<dd>The ID of the Snippet Scan analysis stored in FOSSA's servers. This is used by FOSSA's support team.</dd>
115+
116+
<dt>Bucket ID</dt>
117+
<dd>The ID of the temporary storage bucket where we store Snippet Scan results before processing them. This is used by FOSSA's support team.</dd>
118+
119+
<dt>Files Skipped</dt>
120+
<dd>The number of files skipped during the Snippet Scan.</dd>
121+
122+
<dt>Total Files Processed</dt>
123+
<dd>The number of files processed during the Snippet Scan. This count includes every processed file, even if the same file contents are included multiple times.</dd>
124+
125+
<dt>Unique Files processed</dt>
126+
<dd>The number of unique files processed during the Snippet Scan. If we scan multiple files with the same contents, they will only be counted once.</dd>
127+
128+
<dt>Unique Files with matches found</dt>
129+
<dd>The number of unique files where we found a potential match to Open Source code</dd>
130+
131+
<dt>Unique files with no matches found</dt>
132+
<dd>The number of unique files where no potential matches to Open Source code were found.</dd>
133+
134+
<dt>Unique Files already in our knowledge base</dt>
135+
<dd>The number of files that already exist in FOSSA's knowledge base. These files do not need to be fingerprinted.</dd>
136+
137+
<dt>Unique Files new to our knowledge base</dt>
138+
<dd>The number of files that do not exist in FOSSA's knowledge base. These files needed to be fingerprinted in this Snippet Scan.</dd>
139+
</dl>

docs/references/subcommands/analyze.md

Lines changed: 3 additions & 61 deletions
Original file line numberDiff line numberDiff line change
@@ -152,69 +152,11 @@ Snippet scanning identifies potential open source code snippets within your firs
152152
|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
153153
| `--x-snippet-scan` | Enable snippet scanning during analysis. This experimental feature fingerprints your source files and checks them against FOSSA's snippet database. |
154154

155-
#### How Snippet Scanning Works
155+
Snippet Scanning must also be enabled for your organization, and is only available for enterprise customers. If you would like to enable it for your organization, please [contact us](https://support.fossa.com).
156156

157-
When `--x-snippet-scan` is enabled, the CLI:
157+
#### More detail
158158

159-
1. **Hashes Files First**: Creates CRC64 hashes of all source files to identify which files need fingerprinting
160-
2. **Checks Necessity of Fingerprinting**: Checks with FOSSA servers to determine which file hashes are already known
161-
3. **Fingerprints New or Changed Files**: Uses the Ficus fingerprinting engine (written in Rust) to create cryptographic fingerprints only for files not previously seen
162-
4. **Filters Content**: By default, skips directories like `.git/`, and hidden directories. This includes, from `.fossa.yml`, `vendoredDependencies.licenseScanPathFilters.exclude`, documented further below.
163-
5. **Uploads Fingerprints**: Sends only the fingerprints to FOSSA's servers
164-
6. **Receives Matches**: Gets back information about any matching open source components
165-
7. **Uploads Match Contents**: For files that have matches, uploads source code content temporarily to FOSSA servers.
166-
167-
#### Data Sent to FOSSA
168-
169-
**For Performance Optimization:**
170-
- CRC64 hashes of all files, to avoid re-fingerprinting unchanged files.
171-
172-
**For Fingerprinting:**
173-
- Fingerprints of source code to identify matches.
174-
175-
**For Matched Files Only:**
176-
- The actual source code content of files that contain snippet matches.
177-
178-
#### Data Retention
179-
180-
- **File Fingerprints**: Stored permanently for caching and performance optimization
181-
- **Source Code Content**: Stored temporarily for 30 days and then automatically deleted
182-
- **CRC64 Hashes**: The likelihood of a collision with CRC64 (2^64 possible values) is extremely low.
183-
184-
#### Directory Filtering
185-
186-
By default, snippet scanning excludes common non-production directories and follows `.gitignore` patterns:
187-
188-
- Hidden directories.
189-
- Globs as directed by `.gitignore` files.
190-
191-
#### Custom Exclude Filtering
192-
193-
You can customize which files and directories are excluded from snippet scanning by configuring exclude filters in your `.fossa.yml` file. Note that snippet scanning currently only supports exclude patterns, not `only` patterns.
194-
195-
For example:
196-
```yaml
197-
version: 3
198-
vendoredDependencies:
199-
licenseScanPathFilters:
200-
exclude:
201-
- "**/test/**"
202-
- "**/tests/**"
203-
- "**/spec/**"
204-
- "**/node_modules/**"
205-
- "**/dist/**"
206-
- "**/build/**"
207-
- "**/*.test.js"
208-
- "**/*.spec.ts"
209-
```
210-
211-
**Important Notes:**
212-
213-
- Snippet scanning only uses the `exclude` filters from `licenseScanPathFilters` - `only` filters are ignored for this use-case.
214-
- Path filters use standard glob patterns (e.g., `**/*` for recursive matching, `*` for single-directory matching).
215-
- The configuration goes in the `vendoredDependencies.licenseScanPathFilters.exclude` section.
216-
- These exclude patterns are passed directly to the Ficus fingerprinting engine as `--exclude` arguments.
217-
- Default exclusions (hidden files, `.gitignore` patterns) are applied in addition to custom excludes.
159+
For more detail about how Snippet Scanning works, how to use file filtering during Snippet Scanning, what information is sent to FOSSA's servers and a description of the Snippet Scan Summary, see [the Snippet Scanning feature documentation](../../features/snippet-scanning.md).
218160

219161
### Experimental Options
220162

src/App/Fossa/Ficus/Analyze.hs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -158,6 +158,7 @@ formatFicusScanSummary results =
158158
, " Unique Files new to our knowledge base: " <> toText (show $ ficusStatsUniqueNewFiles stats)
159159
, " Processing time: " <> formatProcessingTime (ficusStatsProcessingTimeSeconds stats) <> "s"
160160
, "============================================================"
161+
, "See the docs for an explanation of this summary: https://github.com/fossas/fossa-cli/blob/master/docs/features/snippet-scanning.md#the-snippet-scan-summary"
161162
]
162163
where
163164
-- Format the processing time as a string with 3 decimal places

0 commit comments

Comments
 (0)