Skip to content

Commit 8bc4f1e

Browse files
feat(worker): Add ALWAYS_INDEX_FILE_PATTERNS env var to specify files that should always be indexed (#631)
1 parent c962fdd commit 8bc4f1e

File tree

5 files changed

+31
-1
lines changed

5 files changed

+31
-1
lines changed

CHANGELOG.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [Unreleased]
99

10+
### Added
11+
- Added `ALWAYS_INDEX_FILE_PATTERNS` environment variable to allow specifying a comma seperated list of glob patterns matching file paths that should always be indexed, regardless of size or # of trigrams. [#631](https://github.com/sourcebot-dev/sourcebot/pull/631)
12+
1013
### Fixed
1114
- Fixed issue where single quotes could not be used in search queries. [#629](https://github.com/sourcebot-dev/sourcebot/pull/629)
1215

docs/docs/configuration/environment-variables.mdx

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ The following environment variables allow you to configure your Sourcebot deploy
3535
| `SOURCEBOT_STRUCTURED_LOGGING_FILE` | - | <p>Optional file to log to if structured logging is enabled</p> |
3636
| `SOURCEBOT_TELEMETRY_DISABLED` | `false` | <p>Enables/disables telemetry collection in Sourcebot. See [this doc](/docs/overview.mdx#telemetry) for more info.</p> |
3737
| `DEFAULT_MAX_MATCH_COUNT` | `10000` | <p>The default maximum number of search results to return when using search in the web app.</p> |
38+
| `ALWAYS_INDEX_FILE_PATTERNS` | - | <p>A comma separated list of glob patterns matching file paths that should always be indexed, regardless of size or number of trigrams.</p> |
3839

3940
### Enterprise Environment Variables
4041
| Variable | Default | Description |

docs/docs/connections/overview.mdx

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,26 @@ To learn more about how to create a connection for a specific code host, check o
6969

7070
<Note>Missing your code host? [Submit a feature request on GitHub](https://github.com/sourcebot-dev/sourcebot/issues/new?template=feature_request.md).</Note>
7171

72+
## Indexing Large Files
73+
74+
By default, Sourcebot will skip indexing files that are larger than 2MB or have more than 20,000 trigrams. You can configure this by setting the `maxFileSize` and `maxTrigramCount` [settings](/docs/configuration/config-file#settings).
75+
76+
These limits can be ignored for specific files by passing in a comma separated list of glob patterns matching file paths to the `ALWAYS_INDEX_FILE_PATTERNS` environment variable. For example:
77+
78+
```bash
79+
# Always index all .sum and .lock files
80+
ALWAYS_INDEX_FILE_PATTERNS=**/*.sum,**/*.lock
81+
```
82+
83+
Files that have been skipped are assigned the `skipped` language. You can view a list of all skipped files by using the following query:
84+
```
85+
lang:skipped
86+
```
87+
88+
## Indexing Binary Files
89+
90+
Binary files cannot be indexed by Sourcebot. See [#575](https://github.com/sourcebot-dev/sourcebot/issues/575) for more information.
91+
7292

7393
## Schema reference
7494
---

packages/backend/src/zoekt.ts

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import { Repo } from "@sourcebot/db";
2-
import { createLogger } from "@sourcebot/shared";
2+
import { createLogger, env } from "@sourcebot/shared";
33
import { exec } from "child_process";
44
import { INDEX_CACHE_DIR } from "./constants.js";
55
import { Settings } from "./types.js";
@@ -11,6 +11,8 @@ export const indexGitRepository = async (repo: Repo, settings: Settings, revisio
1111
const { path: repoPath } = getRepoPath(repo);
1212
const shardPrefix = getShardPrefix(repo.orgId, repo.id);
1313

14+
const largeFileGlobPatterns = env.ALWAYS_INDEX_FILE_PATTERNS?.split(',').map(pattern => pattern.trim()) ?? [];
15+
1416
const command = [
1517
'zoekt-git-index',
1618
'-allow_missing_branches',
@@ -21,6 +23,7 @@ export const indexGitRepository = async (repo: Repo, settings: Settings, revisio
2123
`-tenant_id ${repo.orgId}`,
2224
`-repo_id ${repo.id}`,
2325
`-shard_prefix ${shardPrefix}`,
26+
...largeFileGlobPatterns.map((pattern) => `-large_file ${pattern}`),
2427
repoPath
2528
].join(' ');
2629

packages/shared/src/env.server.ts

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -219,6 +219,9 @@ export const env = createEnv({
219219

220220
// Configure the default maximum number of search results to return by default.
221221
DEFAULT_MAX_MATCH_COUNT: numberSchema.default(10_000),
222+
223+
// A comma separated list of glob patterns that shwould always be indexed regardless of their size.
224+
ALWAYS_INDEX_FILE_PATTERNS: z.string().optional(),
222225
},
223226
runtimeEnv,
224227
emptyStringAsUndefined: true,

0 commit comments

Comments
 (0)