How Git Really Works

How Git Really Works

Foundational Architecture: Content-Addressable Object Storage

Git’s architecture fundamentally differs from traditional version control systems through its implementation as a content-addressable filesystem with version control abstractions layered on top. Understanding this core design principle illuminates why Git commands behave the way they do and enables more sophisticated repository management.

The Object Database: Git’s Storage Foundation

At its lowest level, Git operates as a key-value data store where:

  • Keys are SHA-1 hashes (40-character hexadecimal strings)
  • Values are compressed object contents stored in .git/objects/

This architecture provides several critical guarantees:

Content Integrity: Any modification to an object’s contents produces a different SHA-1 hash, making tampering immediately detectable. Git verifies data integrity on every read operation by recalculating hashes and comparing against stored values.

Deduplication: Identical content across different files, commits, or branches is stored only once. The hash serves as a natural deduplication mechanism—if two files contain identical bytes, they produce identical hashes and reference the same storage object.

Immutability: Once an object is created with a given SHA-1, it never changes. “Modifying” a commit actually creates new objects with different hashes. This immutability underpins Git’s reliable history tracking.


The Three Fundamental Object Types

Git’s entire version control system is built on three object types, each serving a distinct role in the repository structure.

1. Blob Objects: Pure Content Storage

Definition: A blob (Binary Large Object) stores file contents—nothing more.

Key Characteristics:

  • Contains only raw file data (bytes)
  • No filename information
  • No directory structure
  • No metadata (permissions, timestamps)

Storage Example:

# When you add a file containing "Hello, Git!" to staging:
echo "Hello, Git!" | git hash-object --stdin
# Output: 8d0e41234f24b6da002d962a26c2495ea16a425f

# Git stores this content in:
# .git/objects/8d/0e41234f24b6da002d962a26c2495ea16a425f

Deduplication in Action: If three different files across multiple commits contain identical contents, Git creates only one blob object. All references point to this single storage location.

Technical Implication: Renaming a file costs nothing in terms of storage—the blob remains unchanged, only the tree object (see below) updates to reflect the new filename.


2. Tree Objects: Directory Structure Representation

Definition: A tree object represents a directory, mapping filenames to blob objects (for files) or other tree objects (for subdirectories).

Structure:

100644 blob 8d0e41234f24b6da002d962a26c2495ea16a425f    README.md
100644 blob 5f4f6a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f    main.py
040000 tree 9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b    src/

Each Entry Contains:

  • Mode: File permissions (100644 = regular file, 100755 = executable, 040000 = directory)
  • Type: Object type (blob or tree)
  • SHA-1: Hash of the referenced object
  • Name: Filename or directory name

Hierarchical Organization: Trees can reference other trees, creating the familiar directory structure. The root tree represents the repository’s top-level directory.

Example Repository Structure:

Root Tree (abc123...)
├── README.md (blob: 8d0e41...)
├── main.py (blob: 5f4f6a...)
└── src/ (tree: 9a8b7c...)
    ├── __init__.py (blob: 1a2b3c...)
    └── utils.py (blob: 4d5e6f...)

Technical Insight: Moving a file between directories doesn’t duplicate storage—both the old and new tree objects reference the same blob hash. Only the tree metadata changes.


3. Commit Objects: Snapshots with History

Definition: A commit object represents a complete snapshot of the repository at a specific point in time, linking to parent commits to form the project history.

Commit Object Anatomy:

tree 9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b
parent 4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5
author Jane Developer <[email protected]> 1704067200 +0000
committer Jane Developer <[email protected]> 1704067200 +0000

Implement user authentication system

- Add JWT token generation
- Implement password hashing with bcrypt
- Create user session management

Components Explained:

Tree Reference: Points to the root tree object representing the complete repository state at this commit.

Parent References:

  • Most commits have one parent (linear history)
  • Merge commits have multiple parents (2 or more)
  • Initial commit has no parent

Author vs. Committer:

  • Author: Who wrote the changes
  • Committer: Who committed them to the repository
  • These differ when applying patches or rebasing commits created by others

Metadata: Includes timestamp, timezone, and full commit message

Technical Characteristic: Commits reference complete snapshots, not diffs. Git computes differences dynamically by comparing tree objects when needed.


Reference Management: Pointers to Commits

Git’s references provide human-readable names for commit hashes, abstracting away the need to remember 40-character SHA-1 strings.

Branch References: Mutable Pointers

Technical Definition: A branch is a lightweight reference file containing a single commit SHA-1.

Storage Location: .git/refs/heads/<branch-name>

Content Example:

# Contents of .git/refs/heads/main
4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5

Mutability: When you create a new commit on a branch, Git simply updates the file to point to the new commit’s SHA-1. The old commit remains in the object database, now referenced as the parent.

Branch Creation Cost: Creating a branch requires only writing 41 bytes to a file (40-character hash + newline). This explains Git’s philosophy of cheap, frequent branching.

Example:

# Creating a branch is literally:
echo "4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5" > .git/refs/heads/feature-branch

# Git command abstracts this:
git branch feature-branch

HEAD: The Current Position Indicator

Definition: HEAD is a special reference indicating the current checkout position in the repository.

Two Operational Modes:

1. Symbolic Reference (Normal Mode):

# Contents of .git/HEAD
ref: refs/heads/main

HEAD points to a branch reference, which points to a commit. Creating new commits moves the branch reference forward, and HEAD follows automatically.

2. Detached HEAD State:

# Contents of .git/HEAD when detached
4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5

HEAD points directly to a commit SHA-1, not a branch. New commits don’t move any branch reference, creating “orphaned” commits.

Practical Implications:

Normal Operation:

# On branch main
git commit -m "Add feature"
# Result: main branch moves forward, HEAD follows

Detached HEAD:

git checkout 4d3e2f1a0b
# HEAD now points directly to this commit
git commit -m "Experimental change"
# Result: New commit created but no branch references it
# Must create branch to preserve: git branch experiment-branch

Tag References: Immutable Markers

Lightweight Tags: Simple references to commits, functionally identical to branches but conventionally immutable.

# Storage: .git/refs/tags/<tag-name>
# Contents: Single commit SHA-1

Annotated Tags: Full objects with metadata (tagger, date, message), cryptographically signed for release verification.

git tag -a v1.0.0 -m "Release version 1.0.0"
# Creates tag object with metadata, pointing to commit

Use Case Distinction:

  • Lightweight tags: Bookmarks for internal reference
  • Annotated tags: Official releases requiring verification and documentation

Why This Architecture Matters: Practical Implications

Understanding Git’s object model and reference system explains otherwise mysterious behaviors.

Implication 1: Identical Files Cost No Extra Storage

Scenario: You have config.json identical across 50 branches.

Traditional VCS: 50 copies stored Git: One blob object, referenced by 50 different tree objects

Verification:

# Create two files with identical content
echo "test" > file1.txt
echo "test" > file2.txt

git add file1.txt file2.txt
git ls-files -s
# Output shows same blob hash for both files:
# 100644 9daeafb9864cf43055ae93beb0afd6c7d144bfa4 0	file1.txt
# 100644 9daeafb9864cf43055ae93beb0afd6c7d144bfa4 0	file2.txt

Implication 2: Branch Operations Are Instantaneous

Switching branches requires:

  1. Update .git/HEAD to point to new branch (update one file)
  2. Update working tree to match new branch’s tree (filesystem operations)

No data copying occurs—all objects already exist in .git/objects/.

Performance Characteristic: Branch switching time depends on working directory size, not repository history size.


Implication 3: History Is Immutable (By Design)

Cannot Modify a Commit: Changing any aspect of a commit (message, content, metadata) changes its SHA-1 hash, creating a new commit.

“Amending” commits actually creates new commits:

git commit -m "Initial message"
# Creates commit abc123...

git commit --amend -m "Updated message"
# Creates new commit def456... with different hash
# Original commit abc123... still exists (may become garbage collected)

Rebase operations create entirely new commit chains:

# Before rebase
A - B - C (feature-branch)

# After rebase onto main
A' - B' - C' (feature-branch)
# A', B', C' are new commits with different SHAs
# Original A, B, C still exist until garbage collected

Security Implication: This immutability makes Git an excellent audit trail—tampering with history requires regenerating all subsequent commit hashes, which is computationally detectable.


Implication 4: Corruption Detection Is Automatic

Every Git operation that reads objects verifies hash integrity:

# If .git/objects/ab/cdef123... is corrupted:
git cat-file -p abcdef123
# Error: object corrupt: sha1 mismatch

Corruption Sources:

  • Disk errors
  • Filesystem bugs
  • Improper manual modification of .git/

Recovery Strategy:

  1. Identify corrupted object via hash mismatch
  2. Attempt recovery from remote repository
  3. Use git fsck to identify all affected references

Advanced Architectural Details

Pack Files: Storage Optimization

Problem: Storing every version of every file as complete objects is inefficient for large repositories.

Solution: Git periodically compresses objects into pack files using delta compression.

Pack File Structure:

  • Base object stored in full
  • Subsequent versions stored as deltas (differences from base)
  • Dramatically reduces storage for files with minor changes

When Packing Occurs:

  • git gc (garbage collection)
  • git push (packs objects for network transfer)
  • Automatic background operations

Technical Note: Pack files are implementation details—the logical object model remains unchanged. Git transparently unpacks objects when accessed.


Loose Objects vs. Packed Objects

Loose Objects: Individual files in .git/objects/, organized by first two characters of SHA-1:

.git/objects/
├── 4d/
│   └── 3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a8b7c6d5
├── 8d/
│   └── 0e41234f24b6da002d962a26c2495ea16a425f

Packed Objects: Multiple objects compressed into .git/objects/pack/:

.git/objects/pack/
├── pack-abc123def456.idx (index for quick lookup)
└── pack-abc123def456.pack (compressed objects)

Performance Trade-off:

  • Loose: Fast write, slower for large numbers of objects
  • Packed: Slower write (compression overhead), much faster read at scale

The Reflog: Safety Net for “Deleted” Commits

Problem: Commits not referenced by any branch appear lost.

Solution: Git maintains a reference log (reflog) tracking every position HEAD has pointed to.

Storage: .git/logs/

Example:

git reflog
# Output:
# abc123 HEAD@{0}: commit: Add feature
# def456 HEAD@{1}: checkout: moving from main to feature
# ghi789 HEAD@{2}: commit: Fix bug

Recovery Use Case:

# Accidentally reset too far
git reset --hard HEAD~5

# Recover using reflog
git reflog
# Identify lost commit SHA
git reset --hard abc123

Retention: Reflog entries expire after 90 days (default), after which truly unreferenced commits are garbage collected.


Mental Model: Git as a Directed Acyclic Graph (DAG)

Git’s commit history forms a DAG where:

  • Nodes are commits (each with unique SHA-1)
  • Edges are parent relationships (pointing backward in time)
  • No cycles can exist (you can’t be your own ancestor)

Visualization:

        E---F (feature-branch)
       /
  A---B---C---D (main)
       \
        G---H (hotfix-branch)

Graph Properties:

  • Merge commits: Multiple parents (e.g., merging feature into main creates commit with 2 parents)
  • Branch tips: Nodes with names (references) pointing to them
  • Detached commits: Nodes with no path from any branch (eventually garbage collected)

Traversal Operations:

  • git log: Walk backward from HEAD following parent links
  • git rebase: Replay commits on a different base, creating new node chain
  • git merge: Create new node with multiple parents

This graph structure explains:

  • Why finding common ancestors is fast (graph traversal algorithm)
  • How Git determines what to push/pull (compare graph positions)
  • Why merge conflicts occur (divergent paths from common ancestor)

Practical Application: Understanding Command Behavior

Why git checkout Was Confusing

Problem: git checkout performed two conceptually different operations:

  1. Switch branches (move HEAD)
git checkout main  # Move HEAD to point to main branch
  1. Restore files (update working tree)
git checkout -- file.txt  # Restore file from HEAD

Solution: Git 2.23+ split these:

git switch main        # Branch switching only
git restore file.txt   # File restoration only

Why this split works: Aligns commands with underlying operations—switch modifies HEAD reference, restore updates working tree from object database.


Why Branching Is “Free”

Creating 1000 branches:

for i in {1..1000}; do
  git branch branch-$i
done

Storage cost: ~41KB (41 bytes × 1000 branches) Time complexity: O(1) per branch (write single reference file)

No objects are copied—branches are merely pointers into existing history.


Why Commits Are Immutable

Attempting to “modify” a commit:

# Original commit
git commit -m "Add feature"  # SHA: abc123

# "Modify" commit message
git commit --amend -m "Add user authentication feature"  # SHA: def456

What actually happened:

  1. Created new commit object with:
    • Same tree (same files)
    • Same parent
    • Different message → Different SHA
  2. Updated branch reference to point to new commit
  3. Original commit abc123 now unreferenced (but still exists until GC)

This explains:

  • Why force-push is required after amending published commits
  • Why rewriting history changes all subsequent commit SHAs
  • Why recovering “lost” commits via reflog works

Debugging with Object Model Knowledge

Investigating Repository State

# View current HEAD commit
git cat-file -p HEAD
# Output: commit object contents

# View tree structure
git ls-tree HEAD
# Output: tree entries with blob/tree SHAs

# View specific blob content
git cat-file -p <blob-sha>
# Output: file contents

# Verify object integrity
git fsck --full
# Checks all objects for corruption and connectivity

Understanding Merge Conflicts

Merge conflict source: Git attempts three-way merge:

  1. Common ancestor commit (merge base)
  2. Your changes (current branch tip)
  3. Their changes (merging branch tip)

Conflict occurs when: Same lines modified differently in steps 2 and 3.

Resolution strategy: Git provides all three versions:

<<<<<<< HEAD (your changes)
implementation A
=======
implementation B
>>>>>>> feature-branch (their changes)

Manual resolution: Choose implementation, remove conflict markers, stage file.


Summary: Architectural Principles

Git’s design centers on several key principles:

  1. Content-addressable storage: Every piece of data identified by its content hash
  2. Snapshot-based versioning: Commits represent complete repository states, not diffs
  3. Immutable history: Objects never change once created; operations create new objects
  4. Lightweight references: Branches and tags are pointers, not data containers
  5. Distributed by design: Every clone contains full object database

Practical Outcome: Understanding these principles transforms Git from “magic commands” to predictable, logical operations on a well-designed data structure.

Further Exploration: With this foundation, advanced operations like interactive rebase, cherry-picking, and bisect become applications of graph traversal and object manipulation rather than mysterious incantations.


Ready to apply this understanding? Explore advanced Git workflows