How Git Really Works
Foundational Architecture: Content-Addressable Object Storage
Git’s architecture fundamentally differs from traditional version control systems through its implementation as a content-addressable filesystem with version control abstractions layered on top. Understanding this core design principle illuminates why Git commands behave the way they do and enables more sophisticated repository management.
The Object Database: Git’s Storage Foundation
At its lowest level, Git operates as a key-value data store where:
- Keys are SHA-1 hashes (40-character hexadecimal strings)
- Values are compressed object contents stored in
.git/objects/
This architecture provides several critical guarantees:
Content Integrity: Any modification to an object’s contents produces a different SHA-1 hash, making tampering immediately detectable. Git verifies data integrity on every read operation by recalculating hashes and comparing against stored values.
Deduplication: Identical content across different files, commits, or branches is stored only once. The hash serves as a natural deduplication mechanism—if two files contain identical bytes, they produce identical hashes and reference the same storage object.
Immutability: Once an object is created with a given SHA-1, it never changes. “Modifying” a commit actually creates new objects with different hashes. This immutability underpins Git’s reliable history tracking.
The Three Fundamental Object Types
Git’s entire version control system is built on three object types, each serving a distinct role in the repository structure.
1. Blob Objects: Pure Content Storage
Definition: A blob (Binary Large Object) stores file contents—nothing more.
Key Characteristics:
- Contains only raw file data (bytes)
- No filename information
- No directory structure
- No metadata (permissions, timestamps)
Storage Example:
# When you add a file containing "Hello, Git!" to staging:
echo "Hello, Git!" | git hash-object --stdin
# Output: 8d0e41234f24b6da002d962a26c2495ea16a425f
# Git stores this content in:
# .git/objects/8d/0e41234f24b6da002d962a26c2495ea16a425fDeduplication in Action: If three different files across multiple commits contain identical contents, Git creates only one blob object. All references point to this single storage location.
Technical Implication: Renaming a file costs nothing in terms of storage—the blob remains unchanged, only the tree object (see below) updates to reflect the new filename.
2. Tree Objects: Directory Structure Representation
Definition: A tree object represents a directory, mapping filenames to blob objects (for files) or other tree objects (for subdirectories).
Structure:
100644 blob 8d0e41234f24b6da002d962a26c2495ea16a425f README.md
100644 blob 5f4f6a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f main.py
040000 tree 9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b src/Each Entry Contains:
- Mode: File permissions (100644 = regular file, 100755 = executable, 040000 = directory)
- Type: Object type (blob or tree)
- SHA-1: Hash of the referenced object
- Name: Filename or directory name
Hierarchical Organization: Trees can reference other trees, creating the familiar directory structure. The root tree represents the repository’s top-level directory.
Example Repository Structure:
Root Tree (abc123...)
├── README.md (blob: 8d0e41...)
├── main.py (blob: 5f4f6a...)
└── src/ (tree: 9a8b7c...)
├── __init__.py (blob: 1a2b3c...)
└── utils.py (blob: 4d5e6f...)Technical Insight: Moving a file between directories doesn’t duplicate storage—both the old and new tree objects reference the same blob hash. Only the tree metadata changes.
3. Commit Objects: Snapshots with History
Definition: A commit object represents a complete snapshot of the repository at a specific point in time, linking to parent commits to form the project history.
Commit Object Anatomy:
tree 9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b
parent 4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5
author Jane Developer <[email protected]> 1704067200 +0000
committer Jane Developer <[email protected]> 1704067200 +0000
Implement user authentication system
- Add JWT token generation
- Implement password hashing with bcrypt
- Create user session managementComponents Explained:
Tree Reference: Points to the root tree object representing the complete repository state at this commit.
Parent References:
- Most commits have one parent (linear history)
- Merge commits have multiple parents (2 or more)
- Initial commit has no parent
Author vs. Committer:
- Author: Who wrote the changes
- Committer: Who committed them to the repository
- These differ when applying patches or rebasing commits created by others
Metadata: Includes timestamp, timezone, and full commit message
Technical Characteristic: Commits reference complete snapshots, not diffs. Git computes differences dynamically by comparing tree objects when needed.
Reference Management: Pointers to Commits
Git’s references provide human-readable names for commit hashes, abstracting away the need to remember 40-character SHA-1 strings.
Branch References: Mutable Pointers
Technical Definition: A branch is a lightweight reference file containing a single commit SHA-1.
Storage Location: .git/refs/heads/<branch-name>
Content Example:
# Contents of .git/refs/heads/main
4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5Mutability: When you create a new commit on a branch, Git simply updates the file to point to the new commit’s SHA-1. The old commit remains in the object database, now referenced as the parent.
Branch Creation Cost: Creating a branch requires only writing 41 bytes to a file (40-character hash + newline). This explains Git’s philosophy of cheap, frequent branching.
Example:
# Creating a branch is literally:
echo "4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5" > .git/refs/heads/feature-branch
# Git command abstracts this:
git branch feature-branchHEAD: The Current Position Indicator
Definition: HEAD is a special reference indicating the current checkout position in the repository.
Two Operational Modes:
1. Symbolic Reference (Normal Mode):
# Contents of .git/HEAD
ref: refs/heads/mainHEAD points to a branch reference, which points to a commit. Creating new commits moves the branch reference forward, and HEAD follows automatically.
2. Detached HEAD State:
# Contents of .git/HEAD when detached
4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5HEAD points directly to a commit SHA-1, not a branch. New commits don’t move any branch reference, creating “orphaned” commits.
Practical Implications:
Normal Operation:
# On branch main
git commit -m "Add feature"
# Result: main branch moves forward, HEAD followsDetached HEAD:
git checkout 4d3e2f1a0b
# HEAD now points directly to this commit
git commit -m "Experimental change"
# Result: New commit created but no branch references it
# Must create branch to preserve: git branch experiment-branchTag References: Immutable Markers
Lightweight Tags: Simple references to commits, functionally identical to branches but conventionally immutable.
# Storage: .git/refs/tags/<tag-name>
# Contents: Single commit SHA-1Annotated Tags: Full objects with metadata (tagger, date, message), cryptographically signed for release verification.
git tag -a v1.0.0 -m "Release version 1.0.0"
# Creates tag object with metadata, pointing to commitUse Case Distinction:
- Lightweight tags: Bookmarks for internal reference
- Annotated tags: Official releases requiring verification and documentation
Why This Architecture Matters: Practical Implications
Understanding Git’s object model and reference system explains otherwise mysterious behaviors.
Implication 1: Identical Files Cost No Extra Storage
Scenario: You have config.json identical across 50 branches.
Traditional VCS: 50 copies stored Git: One blob object, referenced by 50 different tree objects
Verification:
# Create two files with identical content
echo "test" > file1.txt
echo "test" > file2.txt
git add file1.txt file2.txt
git ls-files -s
# Output shows same blob hash for both files:
# 100644 9daeafb9864cf43055ae93beb0afd6c7d144bfa4 0 file1.txt
# 100644 9daeafb9864cf43055ae93beb0afd6c7d144bfa4 0 file2.txtImplication 2: Branch Operations Are Instantaneous
Switching branches requires:
- Update
.git/HEADto point to new branch (update one file) - Update working tree to match new branch’s tree (filesystem operations)
No data copying occurs—all objects already exist in .git/objects/.
Performance Characteristic: Branch switching time depends on working directory size, not repository history size.
Implication 3: History Is Immutable (By Design)
Cannot Modify a Commit: Changing any aspect of a commit (message, content, metadata) changes its SHA-1 hash, creating a new commit.
“Amending” commits actually creates new commits:
git commit -m "Initial message"
# Creates commit abc123...
git commit --amend -m "Updated message"
# Creates new commit def456... with different hash
# Original commit abc123... still exists (may become garbage collected)Rebase operations create entirely new commit chains:
# Before rebase
A - B - C (feature-branch)
# After rebase onto main
A' - B' - C' (feature-branch)
# A', B', C' are new commits with different SHAs
# Original A, B, C still exist until garbage collectedSecurity Implication: This immutability makes Git an excellent audit trail—tampering with history requires regenerating all subsequent commit hashes, which is computationally detectable.
Implication 4: Corruption Detection Is Automatic
Every Git operation that reads objects verifies hash integrity:
# If .git/objects/ab/cdef123... is corrupted:
git cat-file -p abcdef123
# Error: object corrupt: sha1 mismatchCorruption Sources:
- Disk errors
- Filesystem bugs
- Improper manual modification of
.git/
Recovery Strategy:
- Identify corrupted object via hash mismatch
- Attempt recovery from remote repository
- Use
git fsckto identify all affected references
Advanced Architectural Details
Pack Files: Storage Optimization
Problem: Storing every version of every file as complete objects is inefficient for large repositories.
Solution: Git periodically compresses objects into pack files using delta compression.
Pack File Structure:
- Base object stored in full
- Subsequent versions stored as deltas (differences from base)
- Dramatically reduces storage for files with minor changes
When Packing Occurs:
git gc(garbage collection)git push(packs objects for network transfer)- Automatic background operations
Technical Note: Pack files are implementation details—the logical object model remains unchanged. Git transparently unpacks objects when accessed.
Loose Objects vs. Packed Objects
Loose Objects: Individual files in .git/objects/, organized by first two
characters of SHA-1:
.git/objects/
├── 4d/
│ └── 3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5
├── 8d/
│ └── 0e41234f24b6da002d962a26c2495ea16a425fPacked Objects: Multiple objects compressed into .git/objects/pack/:
.git/objects/pack/
├── pack-abc123def456.idx (index for quick lookup)
└── pack-abc123def456.pack (compressed objects)Performance Trade-off:
- Loose: Fast write, slower for large numbers of objects
- Packed: Slower write (compression overhead), much faster read at scale
The Reflog: Safety Net for “Deleted” Commits
Problem: Commits not referenced by any branch appear lost.
Solution: Git maintains a reference log (reflog) tracking every position HEAD has pointed to.
Storage: .git/logs/
Example:
git reflog
# Output:
# abc123 HEAD@{0}: commit: Add feature
# def456 HEAD@{1}: checkout: moving from main to feature
# ghi789 HEAD@{2}: commit: Fix bugRecovery Use Case:
# Accidentally reset too far
git reset --hard HEAD~5
# Recover using reflog
git reflog
# Identify lost commit SHA
git reset --hard abc123Retention: Reflog entries expire after 90 days (default), after which truly unreferenced commits are garbage collected.
Mental Model: Git as a Directed Acyclic Graph (DAG)
Git’s commit history forms a DAG where:
- Nodes are commits (each with unique SHA-1)
- Edges are parent relationships (pointing backward in time)
- No cycles can exist (you can’t be your own ancestor)
Visualization:
E---F (feature-branch)
/
A---B---C---D (main)
\
G---H (hotfix-branch)Graph Properties:
- Merge commits: Multiple parents (e.g., merging feature into main creates commit with 2 parents)
- Branch tips: Nodes with names (references) pointing to them
- Detached commits: Nodes with no path from any branch (eventually garbage collected)
Traversal Operations:
git log: Walk backward from HEAD following parent linksgit rebase: Replay commits on a different base, creating new node chaingit merge: Create new node with multiple parents
This graph structure explains:
- Why finding common ancestors is fast (graph traversal algorithm)
- How Git determines what to push/pull (compare graph positions)
- Why merge conflicts occur (divergent paths from common ancestor)
Practical Application: Understanding Command Behavior
Why git checkout Was Confusing
Problem: git checkout performed two conceptually different operations:
- Switch branches (move HEAD)
git checkout main # Move HEAD to point to main branch- Restore files (update working tree)
git checkout -- file.txt # Restore file from HEADSolution: Git 2.23+ split these:
git switch main # Branch switching only
git restore file.txt # File restoration onlyWhy this split works: Aligns commands with underlying operations—switch
modifies HEAD reference, restore updates working tree from object database.
Why Branching Is “Free”
Creating 1000 branches:
for i in {1..1000}; do
git branch branch-$i
doneStorage cost: ~41KB (41 bytes × 1000 branches) Time complexity: O(1) per branch (write single reference file)
No objects are copied—branches are merely pointers into existing history.
Why Commits Are Immutable
Attempting to “modify” a commit:
# Original commit
git commit -m "Add feature" # SHA: abc123
# "Modify" commit message
git commit --amend -m "Add user authentication feature" # SHA: def456What actually happened:
- Created new commit object with:
- Same tree (same files)
- Same parent
- Different message → Different SHA
- Updated branch reference to point to new commit
- Original commit
abc123now unreferenced (but still exists until GC)
This explains:
- Why force-push is required after amending published commits
- Why rewriting history changes all subsequent commit SHAs
- Why recovering “lost” commits via reflog works
Debugging with Object Model Knowledge
Investigating Repository State
# View current HEAD commit
git cat-file -p HEAD
# Output: commit object contents
# View tree structure
git ls-tree HEAD
# Output: tree entries with blob/tree SHAs
# View specific blob content
git cat-file -p <blob-sha>
# Output: file contents
# Verify object integrity
git fsck --full
# Checks all objects for corruption and connectivityUnderstanding Merge Conflicts
Merge conflict source: Git attempts three-way merge:
- Common ancestor commit (merge base)
- Your changes (current branch tip)
- Their changes (merging branch tip)
Conflict occurs when: Same lines modified differently in steps 2 and 3.
Resolution strategy: Git provides all three versions:
<<<<<<< HEAD (your changes)
implementation A
=======
implementation B
>>>>>>> feature-branch (their changes)Manual resolution: Choose implementation, remove conflict markers, stage file.
Summary: Architectural Principles
Git’s design centers on several key principles:
- Content-addressable storage: Every piece of data identified by its content hash
- Snapshot-based versioning: Commits represent complete repository states, not diffs
- Immutable history: Objects never change once created; operations create new objects
- Lightweight references: Branches and tags are pointers, not data containers
- Distributed by design: Every clone contains full object database
Practical Outcome: Understanding these principles transforms Git from “magic commands” to predictable, logical operations on a well-designed data structure.
Further Exploration: With this foundation, advanced operations like interactive rebase, cherry-picking, and bisect become applications of graph traversal and object manipulation rather than mysterious incantations.
Ready to apply this understanding? Explore advanced Git workflows