Engineering Standards
Table of Contents
- Core Philosophy
- Code Quality & Linting
- Golang Best Practices
- Repository Structure
- Error Handling
- External Connections
- Kubernetes & GitOps
- Security & Compliance
- Documentation
- Investigation Methodology
Core Philosophy
Security, reliability, and compliance are non-negotiable. Every line of code is a potential attack vector or compliance violation. Move deliberately, not fast. Test all assumptions.
The Master’s Approach
Excellence is a muscle. The more you exercise it, the stronger it becomes. “Doing it right” gets faster with practice, and faster work becomes better with repetition. That’s mastery: producing master-level work with less effort than it takes an apprentice to pick up the tools.
Write every line of code as if you’ll publish it under your own name. Even if a repo will “never see the light of day,” we owe it to ourselves and our successors to maintain excellence. Git is forever. If your name is on it, make it a lesson in coding excellence.
Code Reusability
Design your code so it can be reused and refactored quickly. This isn’t about perfection—it’s about making it easy for your future self to swap out clunky implementations in 5-10 minutes of free time.
Technical debt doesn’t get paid down due to lack of desire; it persists due to lack of time. Front-load your pain by writing tests and modular code. If refactoring is easy and provable through tests, it will get done. Make refactoring easy.
Code Quality & Linting
Golang Linting is Law
golangci-lint is the standard. All Go code must pass golangci-lint with the project’s standardized configuration.
- Zero tolerance for linting violations. If the linter complains, the code is wrong.
- When suggesting code changes, verify they comply with the project’s golangci-lint configuration.
- Reference configuration: .golangci.yml (included in this repository)
Named Returns Are Mandatory
The namedreturns linter (github.com/nikogura/namedreturns) enforces named return values. Named returns provide critical information to both engineers and compilers.
Why Named Returns Matter:
// WRONG - Unnamed returns lack context
func Foo() (string, error) {
return "", nil
}
// RIGHT - Named returns document intent
func Foo() (output string, err error) {
output = "bar"
return output, err
}
Honor the Function Contract:
- Name your return values descriptively
- Return what you promise, not something “just as good”
- Every return statement must explicitly use the named return variables
Critical Linter Assumptions
- Never assume a linter has bugs without explicit evidence
- If lint fails, the problem is your code, not the linter
- Don’t waste time debugging linters—focus on making your code compliant
- If you encounter an error you don’t understand:
- Read the linter’s source code
- Look at test cases in the linter’s repository
- Assume it’s enforcing a discipline you’re not familiar with
- Adapt your code to meet the requirement
Common Lint Patterns
1. Cobra Globals/Init
//nolint:gochecknoglobals // Cobra boilerplate
var rootCmd = &cobra.Command{...}
//nolint:gochecknoinits // Cobra boilerplate
func init() {...}
2. Avoid Nested Closures with Returns
// ANTI-PATTERN - causes namedreturns issues
func Outer() (result string, err error) {
token, err := jwt.Parse(str, func(t *jwt.Token) (interface{}, error) {
if check { return nil, errors.New("bad") }
return key, nil
})
}
// CORRECT - extract to top-level function
func lookupKey(token *jwt.Token, config Config) (key interface{}, err error) {
return key, err
}
func Outer() (result string, err error) {
token, err := jwt.Parse(str, func(t *jwt.Token) (interface{}, error) {
return lookupKey(t, config)
})
}
3. Reduce Cognitive Complexity
Extract helper functions instead of deeply nested logic:
func validateAudience(claims jwt.MapClaims, expected string) bool {...}
func extractGroups(claims jwt.MapClaims) []string {...}
func validateGroupMembership(userGroups, allowedGroups []string) bool {...}
4. Avoid Inline Error Handling
// WRONG
if err := doSomething(); err != nil {...}
// RIGHT
err := doSomething()
if err != nil {...}
5. Proto Field Access
req.GetSourceBucket() // CORRECT
req.SourceBucket // WRONG
Golang Best Practices
Go Proverbs
The language’s author said it best: https://go-proverbs.github.io/
Key proverbs we emphasize:
- interface{} says nothing - Use real interfaces with actual types
- Clear is better than clever - Your 3AM self will thank you
- Errors are values - Use them, don’t just check them
- Don’t panic - Handle errors gracefully
- A little copying is better than a little dependency
Write Golang as Golang
Golang is interface-oriented, not object-oriented. This is subtle but fundamental. Each language has syntax, but also idiom. Learn both, or you’re missing out on what makes the language worth using.
Keep packages flat. Labyrinthine package trees get complicated quickly. If you really need a different package, make it a totally different module with its own lifecycle in a different repo.
SOLID/MVC Principles in Golang
While SOLID was written for OOP (and Golang isn’t OO), the principles remain the same:
- Single Responsibility - Do one thing well (The Unix Way)
- Don’t half-ass two things. Whole-ass one thing. - Ron Swanson
Libraries (Models)
- Write general-purpose, exhaustively tested libraries
- Libraries should consume data and emit data
- No assumptions about how or where data will be used
- Libraries don’t change much unless business fundamentals change
- One library can serve CLI, web, API, and other programs
Views
- Views change frequently
- One Model to many Views relationship
- A CLI is just a view—one way to interact with the Model
- Web pages, APIs, CLIs are all just different views
Tests
- No such thing as a “stupid test”
- Simple tests take 90 seconds to write and live forever
- Little tests add up—no raindrop feels responsible for the flood
- Teach the machine to verify your expectations
Code Layout
Follow https://github.com/golang-standards/project-layout
Key points:
- Put library code in
/pkg(reusable by other programs) - Avoid
/internalunless you have an astoundingly good reason - Put CLI/main code in
/cmd - Never mix the two
Generate standardized repos:
dbt boilerplate gen
Public vs Private
Default to public unless you can articulate why something needs to be private. Don’t impose restrictions on your future self without good reason.
In Golang:
- Public: starts with capital letter (accessible outside package)
- Private: starts with lowercase letter (internal only)
This is especially important for Golang beginners. Some error messages are esoteric, and words are used in slightly different ways that will bite you.
Building Code
Use standard conventions:
go build # Produces binary named after directory
Don’t do this:
go build -o main . # Binary named "main" is meaningless
Seeing /app/main in a container tells you nothing. Seeing /app/curator-bybit provides information.
Generally, a single repository should build a single binary and be buildable with simply go build. If the repo is laid out differently or contains multiple binaries, perhaps you’re being overly clever. Clear is better than clever. Simple is easier to debug in a crisis.
Repository Structure
Standard Dockerfile
Use multi-stage builds with distroless images for minimal attack surface and image size:
FROM golang:1.23.4 AS builder
WORKDIR /app
# Setup Git for SSH access to private repos
RUN git config --global url."[email protected]:".insteadOf "https://github.com/"
RUN mkdir -p ~/.ssh && ssh-keyscan -H github.com >> ~/.ssh/known_hosts
RUN go env -w GOPRIVATE="github.com/<YOUR_ORG>/*"
# Copy dependency files first for layer caching
COPY go.mod go.sum ./
RUN --mount=type=ssh go mod download
# Copy source code
COPY . .
# Build static binary
RUN CGO_ENABLED=0 GOOS=linux go build .
# Use distroless for minimal runtime
FROM gcr.io/distroless/base-debian12:nonroot
# Copy binary with correct ownership
COPY --from=builder --chown=nonroot:nonroot /app/<binary-name> /app/
ENTRYPOINT ["/app/<binary-name>"]
Key points:
- Multi-stage builds keep images small (distroless base is ~20MB vs ~1GB for full golang image)
CGO_ENABLED=0produces static binaries that work in distroless- Distroless images contain only your app and runtime dependencies (no shell, package managers)
nonrootuser runs as UID 65532 for security- Layer caching: copy
go.mod/go.sumbefore source code - SSH mount (
--mount=type=ssh) provides access to private repositories during build - Distroless images handle signals properly - the binary runs as PID 1 and receives signals directly
Building with SSH:
docker build --ssh default .
Container Consistency
Use the same container for all stages:
- If you plan to run on Ubuntu, build with Ubuntu
- If you plan to run on Debian, build and test with Debian
- Never build or test with Alpine if running on Debian/Ubuntu
Alpine uses Musl-C instead of GlibC. Binaries compiled on Alpine won’t run on Debian. These errors require large amounts of debugging hours. Avoid them entirely.
Error Handling
Errors Are Values
Don’t just check errors to satisfy the compiler. Use the value. When constructing error handling routines, make them useful.
Always Do This
thing := "foo"
err := DoSomethingWith(thing)
if err != nil {
err = errors.Wrapf(err, "failed to do something with %s", thing)
return err
}
Add context to errors. Make it easy to grep the code for the error message.
Never Do This
Don’t ignore errors:
thing, _ := SomethingThatReturnsThingAndErr() // WRONG - errors are hidden
Don’t use boilerplate error messages:
const SOME_ERROR = "something went wrong"
return errors.New(SOME_ERROR) // WRONG - can't grep for it
Don’t compress error handling:
// WRONG - hard to read
if err := priceConverter.Start(ctx); err != nil {
return fmt.Errorf("could not start price converter: %w", err)
}
// RIGHT - clear and readable
err := priceConverter.Start(ctx)
if err != nil {
err = errors.Wrapf(err, "could not start price converter for %s", thing.Name())
return err
}
Don’t Panic
panic() and recover() are built into Golang, but we should never use them. Simply put, there are better mechanisms to handle errors. Most of the time, panicking is a sign of laziness or lack of concern for maintainability.
Especially avoid deferred panic recovery:
// HORRIBLE - don't do this
defer func() {
if err := recover(); err != nil {
logrus.WithField("err", err).Error("panic-discovered")
sentry.CaptureException(err.(error))
panic(err) // Re-panicking is even worse
}
}()
This gives you no information about where the error came from. In a 775-line function, good luck finding the error point quickly.
Exit with Logs and Return Codes
Services should not exit unless they cannot continue. When you must exit:
- Use
exit(), notpanic() - Log useful information before exiting
- Return an error code (0 = success, non-zero = failure)
- Prefer
exit(100)overexit(1)for deliberate exits
Standardized error codes are an eventual goal. In the short term, even exit(1) is preferable to panic or no exit at all.
Don’t leave us hanging. Getting paged for a CrashLoopBackoff is not how we should learn about problems.
Return Early
Check errors and bail out of functions as soon as you find one you can’t handle. Golang generally avoids else and prefers early returns.
External Connections
Every system making external connections must support:
- Rate limits that default to 0 (no limit)
- Retries with exponential backoff
- Clear logging - we should always know what’s happening
- Metrics on:
- Number of calls by method
- Number of errors
- Durations
- Backoff period
- All broken down by method
- Never exit or panic on failed connection - workload should continue to self-heal
Pods entering crashloop due to connection errors is an anti-pattern to avoid at all costs.
Retries and Exponential Backoff
We live in an imperfect world. Code the assumption that things won’t always work into your services.
In Kubernetes, pods are mortal. This means the fundamental unit of work is expected to run, die, and be replaced at any time. A CrashLoopBackoff is not an error in itself. Something stuck in CrashLoopBackoff for too long is.
Assume whatever you’re connecting to may not be there or may not answer. Incorporate retry logic, but endless retries become the cause of further errors. Retries need exponential backoff.
Kubernetes & GitOps
GitOps Principles
GitOps, coined by Weaveworks, is the practice of controlling infrastructure declaratively through Git.
In short:
- Everything is applied via IaC through Git
- Nothing is applied outside of IaC through Git
- If something is added/removed in Git, it’s added/removed from the environment
- If something is changed outside of Git, automation changes it back
Never:
- Propose or attempt direct kubectl applies to managed systems
- Modify resources in managed systems without explicit instruction
- Commit or push to git without explicit instruction
No Blind Application of Resources
One of the biggest anti-patterns is blindly downloading and applying resources to clusters. Examples to avoid:
# DON'T do this
kubectl apply -f https://raw.githubusercontent.com/.../deploy.yaml
helm install my-release some-chart
kubectl apply -k <url>
Under no circumstances should anyone apply unevaluated resources to clusters. Read through every line and understand how they apply to your infrastructure and affect your security posture.
Control Repositories
Control Repositories contain Infrastructure as Code configuration. The main controls needed:
- What change was made?
- Who made the change?
- When did the change happen?
- How do we roll back?
All of the above is given for free with VCS, so code reviews and pull requests can cause unnecessary burden and slow down delivery velocity.
That’s not to say there shouldn’t be PR’s and code reviews. We use the same tools (Git, etc) for control repositories and code repositories, but we don’t necessarily need to use the tools in the same way.
Much of the time, ceremonies on a control repository can be reduced or abbreviated. It’s not that we don’t want reviews or communication, but we also don’t want to needlessy distract or get into a pattern where people don’t read or understand the changes and just ‘rubber stamp’ approvals.
Kubernetes Resource Approaches
There are 4 main ways of managing resources in Kubernetes:
You’ll encounter all four eventually. You can’t really work in Kubernetes without uderstanding all 4. Among them, Jsonnet is the author’s preferred solution due to its flexibility and resistance to blind application. Having everything explicitly laid out in code that is checked in also makes it really easy for an agent to read/understand/diagnose problems, and better yet, fix them.
Proto Files: Interface Definitions, Not Data Authorities
Proto files define interfaces between services, not authoritative data types. Their primary function is to define message structures for communication, ensuring consistency in API contracts.
Why This Matters
Your application logic should dictate data types—not proto files. Proto messages are transport representations, optimized for serialization. They’re not designed to be the authoritative source of core data structures.
Pitfalls of Using Proto Messages as Core Types
- Conflicting imports & type mismatches - Multiple services with different expectations
- Code coupling - Changes to proto files ripple across unrelated parts
- Unnecessary complexity - Proto dependencies in utility libraries
- Serialization vs. business logic - Blurred concerns lead to performance bottlenecks
A Better Approach
- Define your own structs - Domain-driven structs that fit your application
- Convert between internal structs & proto messages - Map when communicating
- Keep protos focused on API contracts - Define how services interact, nothing more
- Never reuse tags - Changing a field? Use a new tag number
Example:
// WRONG - Proto-centric
import "github.com/myorg/protos/common"
func GetBlockchainID() common.Blockchain {
return common.Blockchain_ETH // Hard-coupled to proto
}
// RIGHT - Decoupled
type Blockchain int
const (
Ethereum Blockchain = 1
Bitcoin Blockchain = 2
)
func GetBlockchainID() Blockchain {
return Ethereum
}
// Convert only when needed
func ToProtoBlockchain(b Blockchain) common.Blockchain {
return common.Blockchain(b)
}
Think of proto files like blueprints for service contracts—not laws governing your entire codebase.
Observability
Understand the Difference between Metrics and Logs
Logs are expensive. Absent complicated log parsing mechanisms, they’re a stream of noise nobody reads.
Metrics, being essentially numbers, are easy for machines to parse, store, and understand. We prefer to expose critical information via metrics since it’s easy to:
- Perform actions based on them
- Track changes over time
- Look back when incidents happen
Logging “all the things, all the time” is not sustainable. Metrics are much cheaper and easier to handle.
That’s not to say that logs are unimportant. Metrics in systems like Prometheus scale with the cardinality of labels. Cardinality is a huge concern with metrics systems.
Logs on the other hand, allow near infinite cardinality. Some data, like say requests are better captured by logs.
Any real system will incorporate both.
Real-World Example: WAF Security Logs
Web Application Firewall (WAF) logs demonstrate the complementary nature of metrics and logs.
The Problem: WAF systems like ModSecurity generate structured JSON logs for every blocked request. Each log contains:
- Client IP address (potentially millions of unique values)
- Full request URI (infinite possible values)
- Request headers and cookies (high cardinality)
- Matched rule details and attack patterns
- Timestamp and geographic data
Storing this in Prometheus would be impossible - the cardinality would explode the metrics database.
The Solution: Hybrid Approach
Metrics for Aggregates:
# Track overall block rate
rate(modsec_requests_blocked_total[5m])
# Alert on anomalies
modsec_requests_blocked_total > 100/min
# Track by attack type (low cardinality)
modsec_blocks_by_type{type="SQLi"}
modsec_blocks_by_type{type="XSS"}
Logs for Details:
- Ship structured ModSecurity logs to Elasticsearch via Filebeat/Fluent Bit
- Use ingest pipelines to enrich with GeoIP data
- Extract attack types from rule IDs (930=LFI, 941=XSS, 942=SQLi)
- Create searchable indices for investigation
Example Elasticsearch Ingest Pipeline:
{
"processors": [
{
"geoip": {
"field": "client",
"target_field": "geoip"
}
},
{
"script": {
"source": "
if (ctx.transaction?.messages != null) {
def ruleId = ctx.transaction.messages[0]?.details?.ruleId;
if (ruleId?.startsWith('930')) ctx.attack_type = 'Path Traversal/LFI';
else if (ruleId?.startsWith('941')) ctx.attack_type = 'XSS';
else if (ruleId?.startsWith('942')) ctx.attack_type = 'SQLi';
}
"
}
}
]
}
Query Pattern:
# Metrics: "Are we under attack right now?"
curl prometheus:9090/api/v1/query?query=rate(modsec_blocks[1m])
# Logs: "Who attacked us and what did they try?"
curl elasticsearch:9200/modsec-*/_search -d '{
"query": {"range": {"@timestamp": {"gte": "now-1h"}}},
"aggs": {
"by_country": {"terms": {"field": "geoip.country_name"}},
"by_attack": {"terms": {"field": "attack_type"}}
}
}'
Investigation Automation:
The diagnostic-slackbot project demonstrates automated WAF analysis:
- Slack users trigger investigations via slash commands
- Bot queries Loki for recent ModSecurity blocks
- Claude AI analyzes logs and categorizes threats
- Automated reports identify false positives vs. real attacks
- Generates whitelisting recommendations for legitimate traffic
Key Insight: Use metrics for real-time alerting and dashboards. Use logs for forensic analysis and investigation. Neither is sufficient alone.
Prefer Continuous Systems to Jobs
When possible, prefer services that run continuously over Jobs and CronJobs.
If you need a periodic task, code the period into the service itself with internal timers. This provides easier monitoring via Prometheus rather than using a PushGateway.
Prometheus supports the PushGateway, but the support is limited. Prometheus was designed to periodically scrape a long running service. Pushing metrics is opposite to the basic Prometheus design philosophy, and therefore it’s not always as reliable as scraping. The last thing we want is to lose information.
Services with internal cron-like tasks must:
- Run once on server start (so “kick off now” means “delete the pod”)
- Expose metrics:
- Counter for number of runs (with timestamp label)
- Counter for failed runs
- Counter for successful runs
- Gauge showing run duration
List All Metrics in the README
The README alone should be sufficient to set up dashboards and alerts. List all metrics and labels.
Security & Compliance
Non-Negotiables
- Security, compliance, and reliability are non-negotiable
- Never suggest sharing data with third-party vendors without explicit approval
- Access controls and audit trails are critical
- Assume every query is logged and auditable
- Never suggest turning off security controls globally
- All modifications to security controls must be minimal and closely targeted
- Read-only operations only unless explicitly authorized
Investigation Methodology
Critical: Don’t Assume, Investigate
- Read the actual code—don’t pattern-match to common scenarios
- If told to examine specific code, READ IT FIRST
- Organizations often have custom implementations that don’t follow typical patterns
- Stop and examine actual implementation before assuming standard patterns
Systematic Debugging Process:
- Start with the obvious: recent deployments, known incidents
- Check metrics for anomalies
- Correlate with logs
- Cross-reference with system events
- Be systematic—don’t jump to conclusions
- Document your investigation path
For security tool blocks (WAF, etc.):
- Query security logs to understand what triggered
- Identify the rule ID
- Correlate with the upstream request in application logs
- Understand the legitimate use case before suggesting rule changes
Documentation
Documentation Must
- Reside in the repo for the code it describes
- Be written in Markdown
- Be accessible via link from Notion
- Be stored in top-level
/docsdirectory - Be renderable in any format from Markdown source
- Be rendered in PDF format
- Be named clearly in CamelCase (e.g.,
ServiceArchitecture.pdf)
Documentation Must Not
- Be located in multiple places
- Be stored in dead-end formats where they can’t be easily extracted or re-rendered
Rendering
Use pandoc to render Markdown to nearly any format:
pandoc -f markdown -t pdf -o NameOfDoc.pdf <source file>
Warning: LaTeX project is huge. Be prepared for a large download.
Documentation examples should follow this pattern, with PDFs generated from Markdown source.
Communication Style
- Be direct and technical
- Don’t sugar-coat problems
- Assume the user understands the technology
- Skip pleasantries, focus on substance
- If you don’t know something, say so clearly
What NOT to Do
- Don’t propose “quick fixes” that bypass established processes
- Don’t suggest disabling linters or tests to make code pass
- Don’t make assumptions about what’s acceptable in production
- Don’t auto-apply changes without review
- Don’t treat compliance requirements as negotiable
- Don’t propose changes requiring manual intervention in production
Remember
Every decision has security, compliance, and reliability implications. When uncertain about whether something meets standards, err on the side of caution and ask.
Excellence is not optional. It’s a practice, a discipline, and ultimately, a habit. Make it yours.
“I will find a way, or I will make one.” - Hannibal Barca