Skip to content
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
162aadc
parse ficus summary data
spatten Nov 7, 2025
675a7a1
move types into Type.hs
spatten Nov 7, 2025
ff8d6ed
clean up the output
spatten Nov 7, 2025
27a2b3c
fix a test
spatten Nov 7, 2025
4bf0efe
update stats struct
spatten Nov 8, 2025
3b02a72
add a newline
spatten Nov 8, 2025
76c9e1f
fix up summary output
spatten Nov 13, 2025
2e2638b
Merge branch 'master' into snippet-scan-summary
spatten Nov 13, 2025
7132551
use toText
spatten Nov 13, 2025
b391918
add a comment
spatten Nov 13, 2025
a97faa2
just use printf
spatten Nov 13, 2025
5260a5f
add a changelog
spatten Nov 15, 2025
719756a
Merge branch 'master' into snippet-scan-summary
spatten Nov 18, 2025
519b701
first add of docs
spatten Nov 15, 2025
65dacff
updates
spatten Nov 15, 2025
5f537d0
more fingerprint info
spatten Nov 15, 2025
b67fb4e
move most of the docs to features/snippet-scanning.md
spatten Nov 17, 2025
2ba5e4d
remove unused images
spatten Nov 17, 2025
e756e00
fix a link
spatten Nov 17, 2025
ed05b18
use HTML definition lists
spatten Nov 17, 2025
cb4c686
cleanup
spatten Nov 17, 2025
295ded1
link to the snippet scan summary explanation
spatten Nov 17, 2025
4a7aa2a
less verbose
spatten Nov 17, 2025
fd40380
it should be the master branch, not the main
spatten Nov 17, 2025
130e4c0
update changelog and put a warning about changing the heading title
spatten Nov 17, 2025
4cd4a6a
gitignore fossa.debug.zip
spatten Nov 20, 2025
4e12d6d
Merge remote-tracking branch 'origin/master' into snippet-scan-summary
spatten Nov 20, 2025
1629aff
Merge branch 'snippet-scan-summary' into ANE-2616-snippet-scan-docs
spatten Nov 20, 2025
98e2db1
PR comments
spatten Nov 21, 2025
51ee41e
Merge branch 'master' into ANE-2616-snippet-scan-docs
spatten Nov 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions Changelog.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# FOSSA CLI Changelog

## 3.13.1
- Add a summary of the snippet scan when the `--x-snippet-scan` flag is used ([#1613](https://github.com/fossas/fossa-cli/pull/1613))
- Update snippet scanning documentation ([#1615](https://github.com/fossas/fossa-cli/pull/1615))

## 3.13.0
- Change how debug logs are generated. They are now generated in a file called fossa.debug.zip, which can contain multiple files. For the common case of `fossa analyze --debug`, it will now contain the debug bundle (fossa.debug.json) and the telemetry json (fossa.telemetry.json). It will also contain Ficus logs if Ficus is run via --x-snippet-scan ([#1610](https://github.com/fossas/fossa-cli/pull/1610))

Expand Down
137 changes: 137 additions & 0 deletions docs/features/snippet-scanning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@

# Snippet Scanning

Snippet scanning identifies potential open source code snippets within your first-party source code by comparing file fingerprints against FOSSA's knowledge base. This feature helps detect code that may have been copied from open source projects.

Snippet Scanning runs as part of `fossa analyze`. To enable it, add the `--x-snippet-scan` flag when you run `fossa analyze`:

```
fossa analyze --x-snippet-scan
```

Snippet Scanning must also be enabled for your organization, and is only available for enterprise customers. If you would like to enable it for your organization, please [contact us](https://support.fossa.com).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a simple way to verify that it is enabled? It may be good to describe that if so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that I can think of, unfortunately. I'm going to just leave this and maybe we can fix it if it turns out to be a problem in the future


## How Snippet Scanning Works

When `--x-snippet-scan` is enabled, the CLI:

1. **Hashes Files First**: Creates CRC64 hashes of all source files to identify which files need fingerprinting
2. **Checks Necessity of Fingerprinting**: Checks with FOSSA servers to determine which file hashes are already known
3. **Fingerprints New or Changed Files**: Uses the Ficus fingerprinting engine (written in Rust) to create cryptographic fingerprints only for files not previously seen
4. **Filters Content**: By default, skips directories like `.git/`, and hidden directories. This includes, from `.fossa.yml`, `vendoredDependencies.licenseScanPathFilters.exclude`, documented further below.
5. **Uploads Fingerprints**: Sends only the fingerprints to FOSSA's servers
6. **Receives Matches**: Gets back information about any matching open source components
7. **Uploads Match Contents**: For files that have matches, uploads source code content temporarily to FOSSA servers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you characterize temporarily a bit more? Is this optional? If I were a customer reading this I'd want more details since the idea that you're uploading source code could be a bit alarming.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see that you wrote about it more down below. Maybe reference that section here for people like me who freak out before reading the whole doc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done!


## Data Sent to FOSSA

**For Performance Optimization:**
- CRC64 hashes of all files, to avoid re-fingerprinting unchanged files.

**For Fingerprinting:**
- Fingerprints of source code to identify matches.

**For Matched Files Only:**
- The actual source code content of files that contain snippet matches.

## Data Retention

- **File Fingerprints**: Stored permanently for caching and performance optimization
- **Source Code Content**: Stored temporarily for 30 days and then automatically deleted
- **CRC64 Hashes**: The likelihood of a collision with CRC64 (2^64 possible values) is extremely low.

## Directory Filtering

By default, snippet scanning excludes common non-production directories and follows `.gitignore` patterns:

- Hidden directories.
- Globs as directed by `.gitignore` files.

#### Custom Exclude Filtering

You can customize which files and directories are excluded from snippet scanning by configuring exclude filters in your `.fossa.yml` file. Note that snippet scanning currently only supports exclude patterns, not `only` patterns.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this last sentence meant to contrast snippet filtering with our currently existing ones?


For example:
```yaml
version: 3
vendoredDependencies:
licenseScanPathFilters:
exclude:
- "**/test/**"
- "**/tests/**"
- "**/spec/**"
- "**/node_modules/**"
- "**/dist/**"
- "**/build/**"
- "**/*.test.js"
- "**/*.spec.ts"
```

**Important Notes:**

- Snippet scanning only uses the `exclude` filters from `licenseScanPathFilters` - `only` filters are ignored for this use-case.
- Path filters use standard glob patterns (e.g., `**/*` for recursive matching, `*` for single-directory matching).
- The configuration goes in the `vendoredDependencies.licenseScanPathFilters.exclude` section.
- These exclude patterns are passed directly to the Ficus fingerprinting engine as `--exclude` arguments.
- Default exclusions (hidden files, `.gitignore` patterns) are applied in addition to custom excludes.

## A note on scan times

The first time you run a snippet scan on a codebase, it may take a long time to scan. For example, scanning [Linux](https://github.com/torvalds/linux) for the first time takes around 60 minutes. This is because most of the files in your codebase will not exist in FOSSA's knowledge base, and we will need to fingerprint and compare all of them to our snippet scan corpus.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we recommend running an initial manual scan to "prime" before turning this on in CI?

I could see someone naively just turning this on in CI and having a ton of jobs (due to multiple simultaneous pushes/revisions) all start doing the full-scan. I think that'd be bad for Sparkle and also a poor experience for the customer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe a future future version of this (if it's a problem at all) could use content to know if two scans are basically the same and then only let one of them proceed while the others just wait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of recommending that they do an initial scan. I think this takes care of 99% of the problem, as those hypothetical parallel scans will then do almost no work

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth communicating this to support as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's a good point. I'll mention it in the support and snippet scanning channel


However, the next time you scan that codebase we will only need to re-fingerprint and compare files that have changed since the previous scan, and the scan will be much faster. For example, if you snippet scan that same revision of Linux a second time, the scan will complete in less than a minute.

The time it takes to scan newer versions of your codebase will depend on how many files in the new version have not been previously scanned. A file has been previously scanned if the exact same file has ever been snippet scanned. FOSSA recommends snippet scanning your codebase on a regular basis to keep scan times low.

## The Snippet Scan Summary
<!-- Note: this section is linked to from the snippet scan summary in src/App/Fossa/Ficus/Analyze.hs. So if you change this heading name or the path
to this file, you will need to update the link there as well -->

When a Snippet Scan completes, the CLI will output a summary of the scan. It will look like this:

```
============================================================
Snippet scan summary:
Analysis ID: 110054
Bucket ID: 110551
Files skipped: 6
Total Files processed: 18
Unique Files processed: 13
Unique Files with matches found: 4
Unique Files with no matches found: 9
Unique Files already in our knowledge base: 11
Unique Files new to our knowledge base: 2
Processing time: 0.087s
============================================================
```

Here is a description of what each line means:

<dl>
<dt>Analysis ID</dt>
<dd>The ID of the Snippet Scan analysis stored in FOSSA's servers. This is used by FOSSA's support team.</dd>

<dt>Bucket ID</dt>
<dd>The ID of the temporary storage bucket where we store Snippet Scan results before processing them. This is used by FOSSA's support team.</dd>

<dt>Files Skipped</dt>
<dd>The number of files skipped during the Snippet Scan.</dd>

<dt>Total Files Processed</dt>
<dd>The number of files processed during the Snippet Scan. This count includes every processed file, even if the same file contents are included multiple times.</dd>

<dt>Unique Files processed</dt>
<dd>The number of unique files processed during the Snippet Scan. If we scan multiple files with the same contents, they will only be counted once.</dd>

<dt>Unique Files with matches found</dt>
<dd>The number of unique files where we found a potential match to Open Source code</dd>

<dt>Unique files with no matches found</dt>
<dd>The number of unique files where no potential matches to Open Source code were found.</dd>

<dt>Unique Files already in our knowledge base</dt>
<dd>The number of files that already exist in FOSSA's knowledge base. These files do not need to be fingerprinted.</dd>

<dt>Unique Files new to our knowledge base</dt>
<dd>The number of files that do not exist in FOSSA's knowledge base. These files needed to be fingerprinted in this Snippet Scan.</dd>
</dl>
64 changes: 3 additions & 61 deletions docs/references/subcommands/analyze.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,69 +152,11 @@ Snippet scanning identifies potential open source code snippets within your firs
|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `--x-snippet-scan` | Enable snippet scanning during analysis. This experimental feature fingerprints your source files and checks them against FOSSA's snippet database. |

#### How Snippet Scanning Works
Snippet Scanning must also be enabled for your organization, and is only available for enterprise customers. If you would like to enable it for your organization, please [contact us](https://support.fossa.com).

When `--x-snippet-scan` is enabled, the CLI:
#### More detail

1. **Hashes Files First**: Creates CRC64 hashes of all source files to identify which files need fingerprinting
2. **Checks Necessity of Fingerprinting**: Checks with FOSSA servers to determine which file hashes are already known
3. **Fingerprints New or Changed Files**: Uses the Ficus fingerprinting engine (written in Rust) to create cryptographic fingerprints only for files not previously seen
4. **Filters Content**: By default, skips directories like `.git/`, and hidden directories. This includes, from `.fossa.yml`, `vendoredDependencies.licenseScanPathFilters.exclude`, documented further below.
5. **Uploads Fingerprints**: Sends only the fingerprints to FOSSA's servers
6. **Receives Matches**: Gets back information about any matching open source components
7. **Uploads Match Contents**: For files that have matches, uploads source code content temporarily to FOSSA servers.

#### Data Sent to FOSSA

**For Performance Optimization:**
- CRC64 hashes of all files, to avoid re-fingerprinting unchanged files.

**For Fingerprinting:**
- Fingerprints of source code to identify matches.

**For Matched Files Only:**
- The actual source code content of files that contain snippet matches.

#### Data Retention

- **File Fingerprints**: Stored permanently for caching and performance optimization
- **Source Code Content**: Stored temporarily for 30 days and then automatically deleted
- **CRC64 Hashes**: The likelihood of a collision with CRC64 (2^64 possible values) is extremely low.

#### Directory Filtering

By default, snippet scanning excludes common non-production directories and follows `.gitignore` patterns:

- Hidden directories.
- Globs as directed by `.gitignore` files.

#### Custom Exclude Filtering

You can customize which files and directories are excluded from snippet scanning by configuring exclude filters in your `.fossa.yml` file. Note that snippet scanning currently only supports exclude patterns, not `only` patterns.

For example:
```yaml
version: 3
vendoredDependencies:
licenseScanPathFilters:
exclude:
- "**/test/**"
- "**/tests/**"
- "**/spec/**"
- "**/node_modules/**"
- "**/dist/**"
- "**/build/**"
- "**/*.test.js"
- "**/*.spec.ts"
```

**Important Notes:**

- Snippet scanning only uses the `exclude` filters from `licenseScanPathFilters` - `only` filters are ignored for this use-case.
- Path filters use standard glob patterns (e.g., `**/*` for recursive matching, `*` for single-directory matching).
- The configuration goes in the `vendoredDependencies.licenseScanPathFilters.exclude` section.
- These exclude patterns are passed directly to the Ficus fingerprinting engine as `--exclude` arguments.
- Default exclusions (hidden files, `.gitignore` patterns) are applied in addition to custom excludes.
For more detail about how Snippet Scanning works, how to use file filtering during Snippet Scanning, what information is sent to FOSSA's servers and a description of the Snippet Scan Summary, see [the Snippet Scanning feature documentation](../../features/snippet-scanning.md).

### Experimental Options

Expand Down
4 changes: 2 additions & 2 deletions integration-test/Analysis/FicusSpec.hs
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,8 @@ spec = do
case result of
Success _warnings analysisResult -> do
case analysisResult of
Just (FicusSnippetScanResults analysisId) -> do
analysisId `shouldSatisfy` (> 0)
Just results -> do
ficusSnippetScanResultsAnalysisId results `shouldSatisfy` (> 0)
Nothing -> do
-- No snippet scan results returned - this is acceptable for integration testing
True `shouldBe` True
Expand Down
48 changes: 37 additions & 11 deletions src/App/Fossa/Ficus/Analyze.hs
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import App.Fossa.Ficus.Types (
FicusMessageData (..),
FicusMessages (..),
FicusPerStrategyFlag (..),
FicusScanStats (..),
FicusSnippetScanResults (..),
)
import App.Types (ProjectRevision (..))
Expand All @@ -29,8 +30,7 @@ import Control.Carrier.Diagnostics (Diagnostics)
import Control.Concurrent.Async (async, wait)
import Control.Effect.Lift (Has, Lift, sendIO)
import Control.Monad (when)
import Data.Aeson (Object, decode, decodeStrictText, (.:))
import Data.Aeson.Types (parseMaybe)
import Data.Aeson (decode, decodeStrictText)
import Data.ByteString.Lazy qualified as BL
import Data.Conduit ((.|))
import Data.Conduit qualified as Conduit
Expand Down Expand Up @@ -64,6 +64,7 @@ import System.Process.Typed (
waitExitCode,
withProcessWait,
)
import Text.Printf (printf)
import Text.URI (render)
import Text.URI.Builder (PathComponent (PathComponent), TrailingSlash (TrailingSlash), setPath)
import Types (GlobFilter (..), LicenseScanPathFilters (..))
Expand Down Expand Up @@ -116,7 +117,7 @@ analyzeWithFicusMain rootDir apiOpts revision filters snippetScanRetentionDays m
logDebugWithTime "runFicus completed, processing results..."
case ficusResults of
Just results ->
logInfo $ "Ficus analysis completed successfully with analysis ID: " <> pretty (ficusSnippetScanResultsAnalysisId results)
logInfo $ pretty (formatFicusScanSummary results)
Nothing -> logInfo "Ficus analysis completed but no fingerprint findings were found"
pure ficusResults
where
Expand All @@ -131,13 +132,38 @@ analyzeWithFicusMain rootDir apiOpts revision filters snippetScanRetentionDays m
, ficusConfigSnippetScanRetentionDays = snippetScanRetentionDays
}

findingToAnalysisId :: FicusFinding -> Maybe Int
findingToAnalysisId (FicusFinding (FicusMessageData strategy payload))
findingToSnippetScanResult :: FicusFinding -> Maybe FicusSnippetScanResults
findingToSnippetScanResult (FicusFinding (FicusMessageData strategy payload))
| Text.toLower strategy == "fingerprint" =
case decode (BL.fromStrict $ Text.Encoding.encodeUtf8 payload) :: Maybe Object of
Just obj -> parseMaybe (.: "analysis_id") obj
Nothing -> Nothing
findingToAnalysisId _ = Nothing
decode (BL.fromStrict $ Text.Encoding.encodeUtf8 payload)
findingToSnippetScanResult _ = Nothing

formatFicusScanSummary :: FicusSnippetScanResults -> Text
formatFicusScanSummary results =
let stats = ficusSnippetScanResultsStats results
aid = ficusSnippetScanResultsAnalysisId results
in Text.unlines
[ "Ficus snippet scan completed successfully!"
, ""
, "============================================================"
, "Snippet scan summary:"
, " Analysis ID: " <> toText (show aid)
, " Bucket ID: " <> toText (show $ ficusSnippetScanResultsBucketId results)
, " Files skipped: " <> toText (show $ ficusStatsSkippedFiles stats)
, " Total Files processed: " <> toText (show $ ficusStatsProcessedFiles stats)
, " Unique Files processed: " <> toText (show $ ficusStatsUniqueProcessedFiles stats)
, " Unique Files with matches found: " <> toText (show $ ficusStatsUniqueMatchedFiles stats)
, " Unique Files with no matches found: " <> toText (show $ ficusStatsUniqueUnmatchedFiles stats)
, " Unique Files already in our knowledge base: " <> toText (show $ ficusStatsUniqueExistingFiles stats)
, " Unique Files new to our knowledge base: " <> toText (show $ ficusStatsUniqueNewFiles stats)
, " Processing time: " <> formatProcessingTime (ficusStatsProcessingTimeSeconds stats) <> "s"
, "============================================================"
, "See the docs for an explanation of this summary: https://github.com/fossas/fossa-cli/blob/master/docs/features/snippet-scanning.md#the-snippet-scan-summary"
]
where
-- Format the processing time as a string with 3 decimal places
formatProcessingTime :: Double -> Text
formatProcessingTime seconds = toText (printf "%.3f" seconds :: String)

runFicus ::
( Has Diagnostics sig m
Expand Down Expand Up @@ -236,10 +262,10 @@ runFicus maybeDebugDir ficusConfig = do
pure acc
FicusMessageFinding finding -> do
hPutStrLn stderr $ "[" ++ timestamp ++ "] FINDING " <> toString (displayFicusFinding finding)
let analysisFinding = FicusSnippetScanResults <$> findingToAnalysisId finding
let analysisFinding = findingToSnippetScanResult finding
when (isJust acc && isJust analysisFinding) $
hPutStrLn stderr $
"[" ++ timestamp ++ "] ERROR " <> "Found multiple analysis ids."
"[" ++ timestamp ++ "] ERROR " <> "Found multiple ficus analysis responses."
pure $ acc <|> analysisFinding
)
Nothing
Expand Down
42 changes: 39 additions & 3 deletions src/App/Fossa/Ficus/Types.hs
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,12 @@ module App.Fossa.Ficus.Types (
FicusHashFlag (..),
FicusSnippetScanFlag,
FicusSnippetScanResults (..),
FicusScanStats (..),
FicusPerStrategyFlag (..),
) where

import App.Types (ProjectRevision)
import Data.Aeson (FromJSON, Value (Object), withText)
import Data.Aeson qualified as A
import Data.Aeson (FromJSON (parseJSON), Value (Object), withObject, withText)
import Data.Aeson.Types (Parser, (.:))
import Data.Text (Text)
import Fossa.API.Types
Expand All @@ -27,7 +27,43 @@ import Path (Abs, Dir, Path)
import Text.URI
import Types (GlobFilter)

newtype FicusSnippetScanResults = FicusSnippetScanResults {ficusSnippetScanResultsAnalysisId :: Int} deriving (Eq, Ord, Show, Generic)
data FicusSnippetScanResults = FicusSnippetScanResults
{ ficusSnippetScanResultsAnalysisId :: Int
, ficusSnippetScanResultsBucketId :: Int
, ficusSnippetScanResultsStats :: FicusScanStats
}
deriving (Eq, Ord, Show, Generic)

instance FromJSON FicusSnippetScanResults where
parseJSON = withObject "FicusSnippetScanResults" $ \obj ->
FicusSnippetScanResults
<$> obj .: "analysis_id"
<*> obj .: "bucket_id"
<*> obj .: "stats"

data FicusScanStats = FicusScanStats
{ ficusStatsSkippedFiles :: Int
, ficusStatsProcessedFiles :: Int
, ficusStatsUniqueProcessedFiles :: Int
, ficusStatsUniqueNewFiles :: Int
, ficusStatsUniqueExistingFiles :: Int
, ficusStatsUniqueMatchedFiles :: Int
, ficusStatsUniqueUnmatchedFiles :: Int
, ficusStatsProcessingTimeSeconds :: Double
}
deriving (Eq, Ord, Show, Generic)

instance FromJSON FicusScanStats where
parseJSON = withObject "FicusScanStats" $ \obj ->
FicusScanStats
<$> obj .: "skipped_files"
<*> obj .: "processed_files"
<*> obj .: "unique_processed_files"
<*> obj .: "unique_new_files"
<*> obj .: "unique_existing_files"
<*> obj .: "unique_matched_files"
<*> obj .: "unique_unmatched_files"
<*> obj .: "processing_time_seconds"

data FicusMessages = FicusMessages
{ ficusMessageDebugs :: [FicusDebug]
Expand Down
Loading