Filter Branch Operations
Filter Branch Operations: Comprehensive History Rewriting
Filter operations represent Git’s most powerful and potentially destructive capability—rewriting entire repository history across all branches, tags, and commits. Unlike interactive rebase or commit amendments that modify specific commit sequences, filter operations enable systematic transformation of every commit in a repository: removing files, extracting subdirectories, modifying author information, or applying arbitrary transformations to repository structure.
These operations prove essential when repositories contain sensitive data committed by mistake, when extracting components into separate repositories, or when normalizing metadata across history. However, filter operations require careful planning and team coordination—they fundamentally alter commit identities, forcing all team members to migrate to the rewritten history.
Critical Safety Warning
Filter operations rewrite history comprehensively. Every commit SHA will change. All team members must abandon their local clones and re-clone the repository. Use filter operations only when absolutely necessary and with full team awareness.
Architectural Foundation: Two-Generation Tools
Deprecated: git filter-branch
Git’s original filter-branch command, while powerful, suffers from performance limitations and complexity:
# Old approach (DEPRECATED - DO NOT USE)
git filter-branch --force --index-filter \
'git rm --cached --ignore-unmatch path/to/file' \
--prune-empty --tag-name-filter cat -- --allProblems with filter-branch:
- Extremely slow: Shells out for each commit operation
- Memory intensive: Can exhaust memory on large repositories
- Complex syntax: Error-prone command construction
- Limited safety checks: Easy to accidentally destroy repository
- Officially deprecated: Git project recommends against its use
Performance Characteristic: O(n × m) where n = commits and m = operations per commit. For 10,000 commits, can take hours.
Modern Tool: git-filter-repo
Git-filter-repo is a third-party tool that replaced filter-branch as the recommended approach:
Installation:
# macOS
brew install git-filter-repo
# Linux (Debian/Ubuntu)
sudo apt-get install git-filter-repo
# Python pip (cross-platform)
pip install git-filter-repo
# Manual installation
curl -o git-filter-repo https://raw.githubusercontent.com/newren/git-filter-repo/main/git-filter-repo
chmod +x git-filter-repo
sudo mv git-filter-repo /usr/local/bin/Advantages of filter-repo:
- Fast: 10-100x faster than filter-branch
- Safe: Built-in checks prevent common mistakes
- Simple: Intuitive command structure
- Feature-rich: Extensive transformation capabilities
- Well-documented: Comprehensive manual and examples
- Officially recommended: Git project endorses it
Performance Characteristic: O(n) where n = commits. For 10,000 commits, typically completes in seconds to minutes.
Conceptual Model: Repository Transformation Pipeline
Filter operations follow a consistent pattern:
Original Repo → Clone/Backup → Filter Operation → Verification → Force Push → Team MigrationCritical Steps:
- Clone: Create fresh clone for filtering (never filter production clone)
- Filter: Apply transformations
- Verify: Confirm results match intentions
- Force Push: Replace remote history
- Migrate: All team members must re-clone
Why Fresh Clone?: Filter-repo refuses to run on repositories with remote configured, preventing accidental push of filtered history before verification.
Common Use Case 1: Removing Sensitive Data
The most frequent filter operation: removing accidentally committed secrets, passwords, or large files.
Scenario: Secrets Committed to Repository
# Accidentally committed production credentials
git log --all --full-history -- config/secrets.yml
# Shows: commit abc1234 "Add production config"
# ERROR: Contains production database password!Solution with filter-repo:
# Step 1: Fresh clone
cd /tmp
git clone https://github.com/user/repo.git repo-cleanup
cd repo-cleanup
# Step 2: Remove remotes (filter-repo requirement)
git remote remove origin
# Step 3: Remove file from all history
git filter-repo --path config/secrets.yml --invert-paths
# Step 4: Verify file is gone
git log --all --full-history -- config/secrets.yml
# Output: (empty - file never existed)
# Step 5: Force push to rewrite remote
git remote add origin https://github.com/user/repo.git
git push origin --force --all
git push origin --force --tags
# Step 6: Team notification
# CRITICAL: All team members must:
# rm -rf local-repo
# git clone https://github.com/user/repo.gitWhat filter-repo does:
- Walks through every commit in repository
- Removes specified file from each commit’s tree
- Removes now-empty commits (default behavior)
- Updates all references (branches, tags)
- Rewrites commit SHAs throughout history
Multiple File Removal
# Remove multiple specific files
git filter-repo \
--path secrets.yml --invert-paths \
--path .env --invert-paths \
--path config/prod.key --invert-paths
# Remove all files matching pattern
git filter-repo --path-glob '*.key' --invert-paths
# Remove entire directory
git filter-repo --path sensitive-data/ --invert-paths
# Remove files based on regex
git filter-repo --path-regex '^.*\.pem$' --invert-pathsRemoving Large Binary Files
# Find large files in history
git rev-list --objects --all | \
git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' | \
sed -n 's/^blob //p' | \
sort --numeric-sort --key=2 | \
tail -20
# Output shows:
# abc123... 157286400 videos/demo.mp4
# def456... 89432100 datasets/training.zip
# Remove large files
git filter-repo \
--path videos/demo.mp4 --invert-paths \
--path datasets/training.zip --invert-pathsPost-Removal: Run garbage collection to reclaim space:
git reflog expire --expire=now --all
git gc --prune=now --aggressiveCommon Use Case 2: Extracting Subdirectories
Creating new repository from subdirectory while preserving history.
Scenario: Monorepo Decomposition
# Original structure:
repo/
├── backend/ # Want to extract this
├── frontend/
└── shared/
# Goal: Create new repo with only backend/ as rootSolution:
# Step 1: Clone for extraction
git clone https://github.com/user/monorepo.git backend-repo
cd backend-repo
# Step 2: Remove remote
git remote remove origin
# Step 3: Extract subdirectory, making it root
git filter-repo --path backend/ --path-rename backend/:
# Result:
# backend/ → (root)
# All other directories removed
# All commits touching backend/ preserved with history
# Step 4: Verify
git log --oneline # Shows only commits affecting backend/
ls -la # Shows backend/ contents at root level
# Step 5: Push to new repository
git remote add origin https://github.com/user/backend-repo.git
git push origin --all
git push origin --tagsWhat Happened:
--path backend/- Keep only commits touching backend/--path-rename backend/:- Move backend/ contents to root- All commits that only modified frontend/ or shared/ - removed
- Commits that modified backend/ AND other areas - kept, but non-backend changes removed
Multiple Directory Extraction
# Extract backend and shared libraries only
git filter-repo \
--path backend/ \
--path lib/shared/ \
--path-rename backend/:app/ \
--path-rename lib/shared/:shared/
# Result:
# app/ (was backend/)
# shared/ (was lib/shared/)Preserving Specific Files Alongside Directory
# Extract backend/ and keep root-level config
git filter-repo \
--path backend/ \
--path README.md \
--path LICENSE \
--path-rename backend/:Common Use Case 3: Modifying Author Information
Correcting author names and emails throughout history.
Scenario: Wrong Email in Early Commits
# Early commits used personal email, need to change to company email
git log --pretty=format:"%an <%ae>" | sort -u
# Shows:
# John Doe <[email protected]> # Should be [email protected]
# Jane Smith <[email protected]>Solution using mailmap callback:
# Step 1: Create mailmap file
cat > mailmap.txt << 'EOF'
John Doe <[email protected]> <[email protected]>
EOF
# Step 2: Apply with filter-repo
git filter-repo --mailmap mailmap.txt
# Step 3: Verify
git log --pretty=format:"%an <%ae>" | sort -u
# Shows:
# John Doe <[email protected]> # Fixed!
# Jane Smith <[email protected]>Mailmap Format:
Correct Name <[email protected]> <[email protected]>
Correct Name <[email protected]> Wrong Name <[email protected]>Using Callback for Complex Transformations
For transformations beyond simple mailmap:
# Create Python callback script
cat > fix-authors.py << 'EOF'
#!/usr/bin/env python3
def email_callback(email):
"""Normalize email addresses"""
email = email.decode('utf-8')
# Fix common typos
if 'personalemail.com' in email:
email = email.replace('personalemail.com', 'company.com')
# Normalize domain
if '@old-domain.com' in email:
email = email.replace('@old-domain.com', '@new-domain.com')
return email.encode('utf-8')
def name_callback(name):
"""Standardize name format"""
name = name.decode('utf-8')
# Standardize name capitalization
name = name.title()
return name.encode('utf-8')
EOF
# Apply callback
git filter-repo --email-callback 'return email_callback(email)' \
--name-callback 'return name_callback(name)' \
--python-script fix-authors.pyCommon Use Case 4: Repository Splitting
Splitting repository into multiple independent repositories.
Scenario: Extract Multiple Components
# Original monorepo:
monorepo/
├── api/
├── web-client/
└── mobile-app/
# Goal: Three separate repositoriesStrategy: Create three filtered clones:
# Component 1: API
git clone monorepo.git api-repo
cd api-repo
git remote remove origin
git filter-repo --path api/ --path-rename api/:
git remote add origin https://github.com/org/api-repo.git
git push origin --force --all
# Component 2: Web Client
cd ..
git clone monorepo.git web-client-repo
cd web-client-repo
git remote remove origin
git filter-repo --path web-client/ --path-rename web-client/:
git remote add origin https://github.com/org/web-client-repo.git
git push origin --force --all
# Component 3: Mobile App
cd ..
git clone monorepo.git mobile-app-repo
cd mobile-app-repo
git remote remove origin
git filter-repo --path mobile-app/ --path-rename mobile-app/:
git remote add origin https://github.com/org/mobile-app-repo.git
git push origin --force --allAdvanced Techniques: Custom Transformations
Path Renaming and Reorganization
# Flatten nested structure
git filter-repo \
--path-rename src/main/java/:src/ \
--path-rename src/test/java/:tests/
# Result:
# src/main/java/com/example/App.java → src/com/example/App.java
# src/test/java/com/example/AppTest.java → tests/com/example/AppTest.javaSelective History Preservation
# Keep only commits after specific date
git filter-repo --refs HEAD --commit-callback '
if commit.author_date < b"1609459200": # Jan 1, 2021
commit.skip()
'
# Keep only commits by specific authors
git filter-repo --author-email-include "^.*@company\.com$"
# Remove commits by specific author
git filter-repo --author-email-exclude "^bot@"Complex Content Transformation
# Replace text in all files
git filter-repo --replace-text replacements.txt
# replacements.txt format:
# OLD_TEXT==>NEW_TEXT
# regex:old_pattern==>replacement
# Example replacements.txt:
api.old-domain.com==>api.new-domain.com
SECRET_KEY.*$==>SECRET_KEY=<redacted>Analyzing Before Filtering
# Generate repository analysis
git filter-repo --analyze
# Creates .git/filter-repo/analysis/ with:
# - blob-shas-and-paths.txt (all files ever in repo)
# - path-all-sizes.txt (file sizes)
# - path-deleted-sizes.txt (deleted files with sizes)
# - renames.txt (file renames)
# - directories-all-sizes.txt (directory sizes)
# - extensions-all-sizes.txt (file types by size)
# Review to identify:
# - Largest files for potential removal
# - Sensitive file paths
# - Unnecessary directoriesAnalysis-Driven Cleanup:
# Step 1: Analyze
git filter-repo --analyze
# Step 2: Review largest files
sort -k3 -n .git/filter-repo/analysis/path-all-sizes.txt | tail -20
# Step 3: Create removal list
# (files over 10MB or sensitive directories)
cat > remove-paths.txt << EOF
large-dataset.csv
videos/
*.iso
old-builds/
EOF
# Step 4: Filter based on analysis
git filter-repo --paths-from-file remove-paths.txt --invert-pathsAlternative Tools: BFG Repo-Cleaner
BFG is specialized for removing large files and sensitive data—simpler than filter-repo for basic tasks but less flexible.
Installation:
# Download JAR from https://rtyley.github.io/bfg-repo-cleaner/
# Requires Java
# Or via package manager
brew install bfg # macOSBasic Usage:
# Clone with full history
git clone --mirror https://github.com/user/repo.git repo.git
# Remove files larger than 100MB
java -jar bfg.jar --strip-blobs-bigger-than 100M repo.git
# Remove specific file
java -jar bfg.jar --delete-files secrets.yml repo.git
# Remove files matching pattern
java -jar bfg.jar --delete-files "*.key" repo.git
# Replace text (sensitive data)
echo "PASSWORD" > passwords.txt
java -jar bfg.jar --replace-text passwords.txt repo.git
# Clean up and push
cd repo.git
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git pushBFG vs filter-repo:
| Feature | BFG | filter-repo |
|---|---|---|
| Speed | Very fast | Fast |
| Ease of use | Simple | Moderate |
| Flexibility | Limited | Extensive |
| Directory extraction | No | Yes |
| Custom callbacks | No | Yes |
| Path renaming | No | Yes |
| Maintained | Active | Active |
When to use BFG: Simple file removal, especially large binaries or basic secret deletion.
When to use filter-repo: Complex transformations, directory extraction, author changes, path reorganization.
Team Migration Protocols
Filter operations require coordinated team migration. Without proper protocol, team members will experience conflicts and confusion.
Pre-Filter Communication
Team Notification Template:
# CRITICAL: Repository History Rewrite - Action Required
**When**: [Date/Time] **Impact**: All local clones must be deleted and re-cloned
**Why**: Removing sensitive data / Extracting component / [reason]
## What Happens
All commit SHAs will change. Your local repository will be incompatible with the
rewritten history.
## What You Must Do
### Before the rewrite:
1. Push any uncommitted work to a branch
2. Note branch names you're working on
3. Back up any local-only branches
### After the rewrite:
1. Delete your local clone: `rm -rf project-directory`
2. Fresh clone: `git clone https://github.com/org/repo.git`
3. Recreate your working branches from the new history
4. Cherry-pick local-only commits if needed
## Timeline
- [Time]: History rewrite begins
- [Time]: Force push completes
- [Time]: All team members should have migrated
## Questions
Contact [admin] if you have concerns or local work that needs preservation.Post-Filter Verification
# Admin verification checklist
# 1. Verify sensitive data removed
git log --all --full-history -- path/to/secrets.yml
# (Should return nothing)
# 2. Verify repository size reduced (if removing large files)
du -sh .git
# Compare to original size
# 3. Verify branch structure intact
git branch -a
# All expected branches present
# 4. Verify tags intact
git tag
# All expected tags present
# 5. Test clone and basic operations
cd /tmp
git clone https://github.com/org/repo.git test-clone
cd test-clone
# Build, test, verify functionalityTeam Member Migration Steps
# Step 1: Save local work
cd existing-repo
git stash # Save uncommitted changes
git branch -a > my-branches.txt # Document branches
# Step 2: Delete local repository
cd ..
rm -rf existing-repo
# Step 3: Fresh clone
git clone https://github.com/org/repo.git existing-repo
cd existing-repo
# Step 4: Recreate working state
# If you had feature branches:
git checkout -b my-feature origin/main
# Cherry-pick local commits if needed (find SHAs from backup)
# Step 5: Verify
git log --oneline # New SHAs
git status # Clean working treeHandling Local-Only Commits
For commits that exist locally but weren’t pushed before the filter:
# Before migration, in old repo:
# 1. Create patch of local commits
git format-patch origin/main..HEAD -o ~/patches
# After migration, in new repo:
# 2. Apply patches
git am ~/patches/*.patch
# Or use cherry-pick if you know commit SHAs:
# In new repo, after identifying equivalent base commit:
git cherry-pick --strategy=ours <commit-sha-from-old-repo>Security Considerations
Sensitive Data Never Fully Deleted
Critical Understanding: Filtering removes data from Git history, but:
- Forks: Anyone who forked before filtering still has old history
- Clones: All existing clones contain old history until deleted
- Pull Requests: PR discussions may quote sensitive data
- Archives: GitHub/GitLab may have archived snapshots
- Backups: System backups contain old repository state
After Filtering Sensitive Data:
# Essential steps:
# 1. Rotate compromised credentials immediately
# 2. Treat filtered data as if it was publicly exposed
# 3. Audit access logs for unauthorized use
# 4. Contact GitHub/GitLab support to purge cachesAudit Trail Preservation
Before filtering, preserve audit record:
# Create pre-filter audit log
git log --all --pretty=format:"%H|%an|%ae|%ad|%s" > pre-filter-audit.log
# After filtering, compare
git log --all --pretty=format:"%H|%an|%ae|%ad|%s" > post-filter-audit.log
# Document changes
echo "Filter operation: $(date)" >> filter-log.txt
echo "Commits before: $(wc -l < pre-filter-audit.log)" >> filter-log.txt
echo "Commits after: $(wc -l < post-filter-audit.log)" >> filter-log.txtPerformance Optimization
Large Repository Strategies
For repositories with 100,000+ commits:
# Strategy 1: Partial filtering
# Filter specific branches instead of --all
git filter-repo --refs refs/heads/main --refs refs/heads/develop
# Strategy 2: Staged filtering
# Filter recent history first, then archive old history
git filter-repo --refs HEAD~1000..HEAD # Last 1000 commits
# Strategy 3: Parallel processing
# Filter-repo uses multiple cores automatically
# Ensure adequate RAM (8GB+ for large repos)Disk Space Management
# During filtering, disk usage temporarily increases
# Original: .git/ = 1GB
# During filter: .git/ + .git/filter-repo/ = 2GB+
# After filter + GC: .git/ = reduced size
# Ensure adequate space
df -h . # Check available space before filtering
# Clean up after filtering
git reflog expire --expire=now --all
git gc --prune=now --aggressiveTroubleshooting Common Issues
Issue: filter-repo Refuses to Run
Error: Refusing to destructively overwrite repo history
Cause: Remote configured or previous filter run exists.
Solution:
# Fresh clone required
cd /tmp
git clone https://github.com/user/repo.git repo-filter
cd repo-filter
# Remove remote
git remote remove origin
# Now filter-repo will run
git filter-repo --path sensitive/ --invert-pathsIssue: Some Commits Still Contain Sensitive Data
Symptom: After filtering, data still appears in some commits.
Diagnosis:
# Search all branches and tags
git log --all --full-history -p -S "sensitive-string"
# Check if path specification was too narrow
git filter-repo --analyze
# Review .git/filter-repo/analysis/blob-shas-and-paths.txtSolution: Re-filter with broader path specification:
# Original filter (too narrow)
git filter-repo --path config/prod.yml --invert-paths
# Data also in:
# - config/staging.yml
# - backup/config/
# - old-configs/
# Broader filter needed
git filter-repo \
--path-glob "*/prod.yml" --invert-paths \
--path-glob "*/staging.yml" --invert-paths \
--path backup/ --invert-paths \
--path old-configs/ --invert-pathsIssue: Repository Size Not Reduced
Symptom: After filtering and GC, .git directory still large.
Diagnosis:
# Verify objects were removed
git count-objects -vH
# Check if pack files still large
du -sh .git/objects/pack/Solution: Aggressive garbage collection:
# Force aggressive GC
git reflog expire --expire=now --all
git gc --prune=now --aggressive
# Verify reflogs cleared
git reflog # Should be minimal
# If still large, check for remaining large objects
git verify-pack -v .git/objects/pack/*.idx | \
sort -k 3 -n | \
tail -20Issue: Lost Important Commits
Symptom: Realized after filtering that some commits should have been kept.
Recovery:
# IF you haven't pushed yet and didn't delete old clone:
# In old repo
git log --oneline # Find commits to preserve
git format-patch <commit-range>
# In filtered repo
git am *.patch # Apply patches
# IF you pushed and deleted old clone:
# Contact team members who might have old clone
# Or restore from backup if available
# Prevention: ALWAYS test filter on clone before pushingBest Practices Checklist
Pre-Filter
- Create full backup of repository
- Test filter operation on separate clone
- Analyze repository to identify all instances of data to remove
- Coordinate with team about downtime
- Document filter operation for audit trail
- Ensure adequate disk space for operation
During Filter
- Work on fresh clone, not production clone
- Use
--analyzefirst to understand what will change - Apply filters incrementally, verifying each step
- Keep terminal logs of filter operation
- Don’t interrupt filter process mid-operation
Post-Filter
- Verify sensitive data completely removed
- Test repository builds and functionality
- Run garbage collection to reclaim space
- Update documentation referencing old commit SHAs
- Force push with
--force, not--force-with-lease(intentionally rewriting) - Notify team immediately after push
- Monitor team migration progress
- Rotate compromised credentials if filtering sensitive data
- Archive or delete backup after successful migration
Summary: When to Use Filter Operations
Use filter operations for:
- ✅ Removing accidentally committed secrets or sensitive data
- ✅ Extracting subdirectories into separate repositories
- ✅ Removing large files from entire history
- ✅ Correcting author information throughout history
- ✅ Repository reorganization (path restructuring)
- ✅ Cleaning up before open-sourcing private repository
Do NOT use filter operations for:
- ❌ Removing single recent commit (use
git revertorgit reset) - ❌ Changing last commit message (use
git commit --amend) - ❌ Reordering recent commits (use interactive rebase)
- ❌ Cleaning up local feature branch (use rebase)
- ❌ Removing branches (use
git branch -d)
Filter operations represent Git’s most comprehensive history rewriting capability. By understanding the architectural differences between deprecated filter-branch and modern filter-repo, mastering common transformation patterns, and following rigorous safety protocols, you can confidently perform repository-wide cleanup operations while minimizing risk to your team’s workflow and ensuring sensitive data removal when necessary.