Filter Branch Operations

Filter Branch Operations: Comprehensive History Rewriting

Filter operations represent Git’s most powerful and potentially destructive capability—rewriting entire repository history across all branches, tags, and commits. Unlike interactive rebase or commit amendments that modify specific commit sequences, filter operations enable systematic transformation of every commit in a repository: removing files, extracting subdirectories, modifying author information, or applying arbitrary transformations to repository structure.

These operations prove essential when repositories contain sensitive data committed by mistake, when extracting components into separate repositories, or when normalizing metadata across history. However, filter operations require careful planning and team coordination—they fundamentally alter commit identities, forcing all team members to migrate to the rewritten history.

Critical Safety Warning

Filter operations rewrite history comprehensively. Every commit SHA will change. All team members must abandon their local clones and re-clone the repository. Use filter operations only when absolutely necessary and with full team awareness.

Architectural Foundation: Two-Generation Tools

Deprecated: git filter-branch

Git’s original filter-branch command, while powerful, suffers from performance limitations and complexity:

# Old approach (DEPRECATED - DO NOT USE)
git filter-branch --force --index-filter \
  'git rm --cached --ignore-unmatch path/to/file' \
  --prune-empty --tag-name-filter cat -- --all

Problems with filter-branch:

  • Extremely slow: Shells out for each commit operation
  • Memory intensive: Can exhaust memory on large repositories
  • Complex syntax: Error-prone command construction
  • Limited safety checks: Easy to accidentally destroy repository
  • Officially deprecated: Git project recommends against its use

Performance Characteristic: O(n × m) where n = commits and m = operations per commit. For 10,000 commits, can take hours.

Modern Tool: git-filter-repo

Git-filter-repo is a third-party tool that replaced filter-branch as the recommended approach:

Installation:

# macOS
brew install git-filter-repo

# Linux (Debian/Ubuntu)
sudo apt-get install git-filter-repo

# Python pip (cross-platform)
pip install git-filter-repo

# Manual installation
curl -o git-filter-repo https://raw.githubusercontent.com/newren/git-filter-repo/main/git-filter-repo
chmod +x git-filter-repo
sudo mv git-filter-repo /usr/local/bin/

Advantages of filter-repo:

  • Fast: 10-100x faster than filter-branch
  • Safe: Built-in checks prevent common mistakes
  • Simple: Intuitive command structure
  • Feature-rich: Extensive transformation capabilities
  • Well-documented: Comprehensive manual and examples
  • Officially recommended: Git project endorses it

Performance Characteristic: O(n) where n = commits. For 10,000 commits, typically completes in seconds to minutes.

Conceptual Model: Repository Transformation Pipeline

Filter operations follow a consistent pattern:

Original Repo → Clone/Backup → Filter Operation → Verification → Force Push → Team Migration

Critical Steps:

  1. Clone: Create fresh clone for filtering (never filter production clone)
  2. Filter: Apply transformations
  3. Verify: Confirm results match intentions
  4. Force Push: Replace remote history
  5. Migrate: All team members must re-clone

Why Fresh Clone?: Filter-repo refuses to run on repositories with remote configured, preventing accidental push of filtered history before verification.

Common Use Case 1: Removing Sensitive Data

The most frequent filter operation: removing accidentally committed secrets, passwords, or large files.

Scenario: Secrets Committed to Repository

# Accidentally committed production credentials
git log --all --full-history -- config/secrets.yml
# Shows: commit abc1234 "Add production config"
# ERROR: Contains production database password!

Solution with filter-repo:

# Step 1: Fresh clone
cd /tmp
git clone https://github.com/user/repo.git repo-cleanup
cd repo-cleanup

# Step 2: Remove remotes (filter-repo requirement)
git remote remove origin

# Step 3: Remove file from all history
git filter-repo --path config/secrets.yml --invert-paths

# Step 4: Verify file is gone
git log --all --full-history -- config/secrets.yml
# Output: (empty - file never existed)

# Step 5: Force push to rewrite remote
git remote add origin https://github.com/user/repo.git
git push origin --force --all
git push origin --force --tags

# Step 6: Team notification
# CRITICAL: All team members must:
# rm -rf local-repo
# git clone https://github.com/user/repo.git

What filter-repo does:

  • Walks through every commit in repository
  • Removes specified file from each commit’s tree
  • Removes now-empty commits (default behavior)
  • Updates all references (branches, tags)
  • Rewrites commit SHAs throughout history

Multiple File Removal

# Remove multiple specific files
git filter-repo \
  --path secrets.yml --invert-paths \
  --path .env --invert-paths \
  --path config/prod.key --invert-paths

# Remove all files matching pattern
git filter-repo --path-glob '*.key' --invert-paths

# Remove entire directory
git filter-repo --path sensitive-data/ --invert-paths

# Remove files based on regex
git filter-repo --path-regex '^.*\.pem$' --invert-paths

Removing Large Binary Files

# Find large files in history
git rev-list --objects --all | \
  git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | \
  sed -n 's/^blob //p' | \
  sort --numeric-sort --key=2 | \
  tail -20

# Output shows:
# abc123... 157286400 videos/demo.mp4
# def456... 89432100 datasets/training.zip

# Remove large files
git filter-repo \
  --path videos/demo.mp4 --invert-paths \
  --path datasets/training.zip --invert-paths

Post-Removal: Run garbage collection to reclaim space:

git reflog expire --expire=now --all
git gc --prune=now --aggressive

Common Use Case 2: Extracting Subdirectories

Creating new repository from subdirectory while preserving history.

Scenario: Monorepo Decomposition

# Original structure:
repo/
├── backend/     # Want to extract this
├── frontend/
└── shared/

# Goal: Create new repo with only backend/ as root

Solution:

# Step 1: Clone for extraction
git clone https://github.com/user/monorepo.git backend-repo
cd backend-repo

# Step 2: Remove remote
git remote remove origin

# Step 3: Extract subdirectory, making it root
git filter-repo --path backend/ --path-rename backend/:

# Result:
# backend/ → (root)
# All other directories removed
# All commits touching backend/ preserved with history

# Step 4: Verify
git log --oneline  # Shows only commits affecting backend/
ls -la  # Shows backend/ contents at root level

# Step 5: Push to new repository
git remote add origin https://github.com/user/backend-repo.git
git push origin --all
git push origin --tags

What Happened:

  1. --path backend/ - Keep only commits touching backend/
  2. --path-rename backend/: - Move backend/ contents to root
  3. All commits that only modified frontend/ or shared/ - removed
  4. Commits that modified backend/ AND other areas - kept, but non-backend changes removed

Multiple Directory Extraction

# Extract backend and shared libraries only
git filter-repo \
  --path backend/ \
  --path lib/shared/ \
  --path-rename backend/:app/ \
  --path-rename lib/shared/:shared/

# Result:
# app/         (was backend/)
# shared/      (was lib/shared/)

Preserving Specific Files Alongside Directory

# Extract backend/ and keep root-level config
git filter-repo \
  --path backend/ \
  --path README.md \
  --path LICENSE \
  --path-rename backend/:

Common Use Case 3: Modifying Author Information

Correcting author names and emails throughout history.

Scenario: Wrong Email in Early Commits

# Early commits used personal email, need to change to company email
git log --pretty=format:"%an <%ae>" | sort -u
# Shows:
# John Doe <[email protected]>  # Should be [email protected]
# Jane Smith <[email protected]>

Solution using mailmap callback:

# Step 1: Create mailmap file
cat > mailmap.txt << 'EOF'
John Doe <[email protected]> <[email protected]>
EOF

# Step 2: Apply with filter-repo
git filter-repo --mailmap mailmap.txt

# Step 3: Verify
git log --pretty=format:"%an <%ae>" | sort -u
# Shows:
# John Doe <[email protected]>  # Fixed!
# Jane Smith <[email protected]>

Mailmap Format:

Correct Name <[email protected]> <[email protected]>
Correct Name <[email protected]> Wrong Name <[email protected]>

Using Callback for Complex Transformations

For transformations beyond simple mailmap:

# Create Python callback script
cat > fix-authors.py << 'EOF'
#!/usr/bin/env python3

def email_callback(email):
    """Normalize email addresses"""
    email = email.decode('utf-8')

    # Fix common typos
    if 'personalemail.com' in email:
        email = email.replace('personalemail.com', 'company.com')

    # Normalize domain
    if '@old-domain.com' in email:
        email = email.replace('@old-domain.com', '@new-domain.com')

    return email.encode('utf-8')

def name_callback(name):
    """Standardize name format"""
    name = name.decode('utf-8')

    # Standardize name capitalization
    name = name.title()

    return name.encode('utf-8')
EOF

# Apply callback
git filter-repo --email-callback 'return email_callback(email)' \
                --name-callback 'return name_callback(name)' \
                --python-script fix-authors.py

Common Use Case 4: Repository Splitting

Splitting repository into multiple independent repositories.

Scenario: Extract Multiple Components

# Original monorepo:
monorepo/
├── api/
├── web-client/
└── mobile-app/

# Goal: Three separate repositories

Strategy: Create three filtered clones:

# Component 1: API
git clone monorepo.git api-repo
cd api-repo
git remote remove origin
git filter-repo --path api/ --path-rename api/:
git remote add origin https://github.com/org/api-repo.git
git push origin --force --all

# Component 2: Web Client
cd ..
git clone monorepo.git web-client-repo
cd web-client-repo
git remote remove origin
git filter-repo --path web-client/ --path-rename web-client/:
git remote add origin https://github.com/org/web-client-repo.git
git push origin --force --all

# Component 3: Mobile App
cd ..
git clone monorepo.git mobile-app-repo
cd mobile-app-repo
git remote remove origin
git filter-repo --path mobile-app/ --path-rename mobile-app/:
git remote add origin https://github.com/org/mobile-app-repo.git
git push origin --force --all

Advanced Techniques: Custom Transformations

Path Renaming and Reorganization

# Flatten nested structure
git filter-repo \
  --path-rename src/main/java/:src/ \
  --path-rename src/test/java/:tests/

# Result:
# src/main/java/com/example/App.java → src/com/example/App.java
# src/test/java/com/example/AppTest.java → tests/com/example/AppTest.java

Selective History Preservation

# Keep only commits after specific date
git filter-repo --refs HEAD --commit-callback '
  if commit.author_date < b"1609459200":  # Jan 1, 2021
    commit.skip()
'

# Keep only commits by specific authors
git filter-repo --author-email-include "^.*@company\.com$"

# Remove commits by specific author
git filter-repo --author-email-exclude "^bot@"

Complex Content Transformation

# Replace text in all files
git filter-repo --replace-text replacements.txt

# replacements.txt format:
# OLD_TEXT==>NEW_TEXT
# regex:old_pattern==>replacement

# Example replacements.txt:
api.old-domain.com==>api.new-domain.com
SECRET_KEY.*$==>SECRET_KEY=<redacted>

Analyzing Before Filtering

# Generate repository analysis
git filter-repo --analyze

# Creates .git/filter-repo/analysis/ with:
# - blob-shas-and-paths.txt (all files ever in repo)
# - path-all-sizes.txt (file sizes)
# - path-deleted-sizes.txt (deleted files with sizes)
# - renames.txt (file renames)
# - directories-all-sizes.txt (directory sizes)
# - extensions-all-sizes.txt (file types by size)

# Review to identify:
# - Largest files for potential removal
# - Sensitive file paths
# - Unnecessary directories

Analysis-Driven Cleanup:

# Step 1: Analyze
git filter-repo --analyze

# Step 2: Review largest files
sort -k3 -n .git/filter-repo/analysis/path-all-sizes.txt | tail -20

# Step 3: Create removal list
# (files over 10MB or sensitive directories)
cat > remove-paths.txt << EOF
large-dataset.csv
videos/
*.iso
old-builds/
EOF

# Step 4: Filter based on analysis
git filter-repo --paths-from-file remove-paths.txt --invert-paths

Alternative Tools: BFG Repo-Cleaner

BFG is specialized for removing large files and sensitive data—simpler than filter-repo for basic tasks but less flexible.

Installation:

# Download JAR from https://rtyley.github.io/bfg-repo-cleaner/
# Requires Java

# Or via package manager
brew install bfg  # macOS

Basic Usage:

# Clone with full history
git clone --mirror https://github.com/user/repo.git repo.git

# Remove files larger than 100MB
java -jar bfg.jar --strip-blobs-bigger-than 100M repo.git

# Remove specific file
java -jar bfg.jar --delete-files secrets.yml repo.git

# Remove files matching pattern
java -jar bfg.jar --delete-files "*.key" repo.git

# Replace text (sensitive data)
echo "PASSWORD" > passwords.txt
java -jar bfg.jar --replace-text passwords.txt repo.git

# Clean up and push
cd repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git push

BFG vs filter-repo:

FeatureBFGfilter-repo
SpeedVery fastFast
Ease of useSimpleModerate
FlexibilityLimitedExtensive
Directory extractionNoYes
Custom callbacksNoYes
Path renamingNoYes
MaintainedActiveActive

When to use BFG: Simple file removal, especially large binaries or basic secret deletion.

When to use filter-repo: Complex transformations, directory extraction, author changes, path reorganization.

Team Migration Protocols

Filter operations require coordinated team migration. Without proper protocol, team members will experience conflicts and confusion.

Pre-Filter Communication

Team Notification Template:

# CRITICAL: Repository History Rewrite - Action Required

**When**: [Date/Time] **Impact**: All local clones must be deleted and re-cloned
**Why**: Removing sensitive data / Extracting component / [reason]

## What Happens

All commit SHAs will change. Your local repository will be incompatible with the
rewritten history.

## What You Must Do

### Before the rewrite:

1. Push any uncommitted work to a branch
2. Note branch names you're working on
3. Back up any local-only branches

### After the rewrite:

1. Delete your local clone: `rm -rf project-directory`
2. Fresh clone: `git clone https://github.com/org/repo.git`
3. Recreate your working branches from the new history
4. Cherry-pick local-only commits if needed

## Timeline

- [Time]: History rewrite begins
- [Time]: Force push completes
- [Time]: All team members should have migrated

## Questions

Contact [admin] if you have concerns or local work that needs preservation.

Post-Filter Verification

# Admin verification checklist
# 1. Verify sensitive data removed
git log --all --full-history -- path/to/secrets.yml
# (Should return nothing)

# 2. Verify repository size reduced (if removing large files)
du -sh .git
# Compare to original size

# 3. Verify branch structure intact
git branch -a
# All expected branches present

# 4. Verify tags intact
git tag
# All expected tags present

# 5. Test clone and basic operations
cd /tmp
git clone https://github.com/org/repo.git test-clone
cd test-clone
# Build, test, verify functionality

Team Member Migration Steps

# Step 1: Save local work
cd existing-repo
git stash  # Save uncommitted changes
git branch -a > my-branches.txt  # Document branches

# Step 2: Delete local repository
cd ..
rm -rf existing-repo

# Step 3: Fresh clone
git clone https://github.com/org/repo.git existing-repo
cd existing-repo

# Step 4: Recreate working state
# If you had feature branches:
git checkout -b my-feature origin/main
# Cherry-pick local commits if needed (find SHAs from backup)

# Step 5: Verify
git log --oneline  # New SHAs
git status  # Clean working tree

Handling Local-Only Commits

For commits that exist locally but weren’t pushed before the filter:

# Before migration, in old repo:
# 1. Create patch of local commits
git format-patch origin/main..HEAD -o ~/patches

# After migration, in new repo:
# 2. Apply patches
git am ~/patches/*.patch

# Or use cherry-pick if you know commit SHAs:
# In new repo, after identifying equivalent base commit:
git cherry-pick --strategy=ours <commit-sha-from-old-repo>

Security Considerations

Sensitive Data Never Fully Deleted

Critical Understanding: Filtering removes data from Git history, but:

  1. Forks: Anyone who forked before filtering still has old history
  2. Clones: All existing clones contain old history until deleted
  3. Pull Requests: PR discussions may quote sensitive data
  4. Archives: GitHub/GitLab may have archived snapshots
  5. Backups: System backups contain old repository state

After Filtering Sensitive Data:

# Essential steps:
# 1. Rotate compromised credentials immediately
# 2. Treat filtered data as if it was publicly exposed
# 3. Audit access logs for unauthorized use
# 4. Contact GitHub/GitLab support to purge caches

Audit Trail Preservation

Before filtering, preserve audit record:

# Create pre-filter audit log
git log --all --pretty=format:"%H|%an|%ae|%ad|%s" > pre-filter-audit.log

# After filtering, compare
git log --all --pretty=format:"%H|%an|%ae|%ad|%s" > post-filter-audit.log

# Document changes
echo "Filter operation: $(date)" >> filter-log.txt
echo "Commits before: $(wc -l < pre-filter-audit.log)" >> filter-log.txt
echo "Commits after: $(wc -l < post-filter-audit.log)" >> filter-log.txt

Performance Optimization

Large Repository Strategies

For repositories with 100,000+ commits:

# Strategy 1: Partial filtering
# Filter specific branches instead of --all
git filter-repo --refs refs/heads/main --refs refs/heads/develop

# Strategy 2: Staged filtering
# Filter recent history first, then archive old history
git filter-repo --refs HEAD~1000..HEAD  # Last 1000 commits

# Strategy 3: Parallel processing
# Filter-repo uses multiple cores automatically
# Ensure adequate RAM (8GB+ for large repos)

Disk Space Management

# During filtering, disk usage temporarily increases
# Original: .git/ = 1GB
# During filter: .git/ + .git/filter-repo/ = 2GB+
# After filter + GC: .git/ = reduced size

# Ensure adequate space
df -h .  # Check available space before filtering

# Clean up after filtering
git reflog expire --expire=now --all
git gc --prune=now --aggressive

Troubleshooting Common Issues

Issue: filter-repo Refuses to Run

Error: Refusing to destructively overwrite repo history

Cause: Remote configured or previous filter run exists.

Solution:

# Fresh clone required
cd /tmp
git clone https://github.com/user/repo.git repo-filter
cd repo-filter

# Remove remote
git remote remove origin

# Now filter-repo will run
git filter-repo --path sensitive/ --invert-paths

Issue: Some Commits Still Contain Sensitive Data

Symptom: After filtering, data still appears in some commits.

Diagnosis:

# Search all branches and tags
git log --all --full-history -p -S "sensitive-string"

# Check if path specification was too narrow
git filter-repo --analyze
# Review .git/filter-repo/analysis/blob-shas-and-paths.txt

Solution: Re-filter with broader path specification:

# Original filter (too narrow)
git filter-repo --path config/prod.yml --invert-paths

# Data also in:
# - config/staging.yml
# - backup/config/
# - old-configs/

# Broader filter needed
git filter-repo \
  --path-glob "*/prod.yml" --invert-paths \
  --path-glob "*/staging.yml" --invert-paths \
  --path backup/ --invert-paths \
  --path old-configs/ --invert-paths

Issue: Repository Size Not Reduced

Symptom: After filtering and GC, .git directory still large.

Diagnosis:

# Verify objects were removed
git count-objects -vH

# Check if pack files still large
du -sh .git/objects/pack/

Solution: Aggressive garbage collection:

# Force aggressive GC
git reflog expire --expire=now --all
git gc --prune=now --aggressive

# Verify reflogs cleared
git reflog  # Should be minimal

# If still large, check for remaining large objects
git verify-pack -v .git/objects/pack/*.idx | \
  sort -k 3 -n | \
  tail -20

Issue: Lost Important Commits

Symptom: Realized after filtering that some commits should have been kept.

Recovery:

# IF you haven't pushed yet and didn't delete old clone:
# In old repo
git log --oneline  # Find commits to preserve
git format-patch <commit-range>

# In filtered repo
git am *.patch  # Apply patches

# IF you pushed and deleted old clone:
# Contact team members who might have old clone
# Or restore from backup if available

# Prevention: ALWAYS test filter on clone before pushing

Best Practices Checklist

Pre-Filter

  • Create full backup of repository
  • Test filter operation on separate clone
  • Analyze repository to identify all instances of data to remove
  • Coordinate with team about downtime
  • Document filter operation for audit trail
  • Ensure adequate disk space for operation

During Filter

  • Work on fresh clone, not production clone
  • Use --analyze first to understand what will change
  • Apply filters incrementally, verifying each step
  • Keep terminal logs of filter operation
  • Don’t interrupt filter process mid-operation

Post-Filter

  • Verify sensitive data completely removed
  • Test repository builds and functionality
  • Run garbage collection to reclaim space
  • Update documentation referencing old commit SHAs
  • Force push with --force, not --force-with-lease (intentionally rewriting)
  • Notify team immediately after push
  • Monitor team migration progress
  • Rotate compromised credentials if filtering sensitive data
  • Archive or delete backup after successful migration

Summary: When to Use Filter Operations

Use filter operations for:

  • ✅ Removing accidentally committed secrets or sensitive data
  • ✅ Extracting subdirectories into separate repositories
  • ✅ Removing large files from entire history
  • ✅ Correcting author information throughout history
  • ✅ Repository reorganization (path restructuring)
  • ✅ Cleaning up before open-sourcing private repository

Do NOT use filter operations for:

  • ❌ Removing single recent commit (use git revert or git reset)
  • ❌ Changing last commit message (use git commit --amend)
  • ❌ Reordering recent commits (use interactive rebase)
  • ❌ Cleaning up local feature branch (use rebase)
  • ❌ Removing branches (use git branch -d)

Filter operations represent Git’s most comprehensive history rewriting capability. By understanding the architectural differences between deprecated filter-branch and modern filter-repo, mastering common transformation patterns, and following rigorous safety protocols, you can confidently perform repository-wide cleanup operations while minimizing risk to your team’s workflow and ensuring sensitive data removal when necessary.