Unique / Distinct
Removing duplicate rows ensures data quality and accuracy in your analysis. Visitran's No-Code UI provides easy tools to identify and eliminate duplicates.
Overview
The DISTINCT operation removes duplicate rows from your dataset, keeping only unique combinations of values. This is essential for data quality, accurate counting, and proper analysis.
When to Use DISTINCT
Data Quality
Remove unintentional duplicates caused by:
- Data loading errors
- Multiple source systems
- Incorrect joins
- System glitches
Accurate Counting
Ensure correct counts when:
- Counting unique customers
- Analyzing unique products sold
- Tracking distinct events
- Reporting unique values
List Generation
Create clean lists of:
- Unique categories
- Distinct product codes
- Available options
- Valid values
How DISTINCT Works
DISTINCT compares all visible columns in each row. Two rows are considered duplicates if all visible columns have identical values.
Example:
Before DISTINCT:
| customer_id | order_date | amount |
|------------|------------|--------|
| 100 | 2024-01-15 | 250.00 |
| 100 | 2024-01-15 | 250.00 | ← Duplicate
| 101 | 2024-01-16 | 180.00 |
After DISTINCT:
| customer_id | order_date | amount |
|------------|------------|--------|
| 100 | 2024-01-15 | 250.00 |
| 101 | 2024-01-16 | 180.00 |
How to Apply DISTINCT in Visitran
Step-by-Step Instructions
- Select Your Model - Navigate to the model with duplicate rows
- Add DISTINCT Operation - Click "Add Distinct" or enable unique rows
- Choose Columns - Select which columns to consider for uniqueness
- Preview Results - View the deduplicated data
- Verify Count - Check that the expected number of rows remain
- Save - Apply the changes to your model
All Columns DISTINCT
Remove rows where all columns are identical.
Use When: You want exact duplicate row removal.
Specific Columns DISTINCT
Consider only certain columns for uniqueness.
Use When: You want to keep the first occurrence based on specific fields.
Common Use Cases
Unique Customer List
DISTINCT on: customer_id
Result: One row per customer
Product Catalog
DISTINCT on: sku, product_name
Result: One row per unique product
Available Categories
DISTINCT on: category
Result: List of all categories
Unique Dates
DISTINCT on: order_date
Result: All dates with at least one order
DISTINCT vs. GROUP BY
| DISTINCT | GROUP BY |
|---|---|
| Removes duplicates | Groups and aggregates |
| Keeps first occurrence | Calculates across group |
| Simple deduplication | Summary statistics |
| No aggregation | Requires aggregation |
Example:
DISTINCT:
Returns: One row per customer
GROUP BY customer_id:
Returns: One row per customer WITH counts, sums, etc.
Best Practices
- Understand Your Data - Know why duplicates exist before removing them
- Keep Source Data - Create a new model rather than modifying source
- Document Reasoning - Explain why DISTINCT is needed
- Verify Results - Always check row counts before and after
- Consider Ordering - If keeping "first" occurrence matters, sort first
Performance Considerations
- Index Columns - Index columns used for DISTINCT checking
- Reduce Columns - Fewer columns make DISTINCT faster
- Early Filtering - Apply WHERE clauses before DISTINCT
- Avoid Large Rows - Large text/binary columns slow down DISTINCT
Strategies for Handling Duplicates
Keep First Occurrence
1. Sort by priority column
2. Apply DISTINCT
3. First matching row is kept
Keep Last Occurrence
1. Sort by timestamp DESC
2. Apply DISTINCT
3. Most recent row is kept
Best Record Selection
1. Sort by quality indicators
2. Apply DISTINCT
3. Best quality record is kept
Advanced Deduplication
Conditional Deduplication
Remove duplicates based on certain conditions:
1. Filter to specific subset
2. Apply DISTINCT
3. Union with remaining data
Fuzzy Deduplication
For near-duplicates (require formulas or external tools):
- Standardize text (UPPER, TRIM)
- Remove special characters
- Compare similarity scores
Time-Window Deduplication
Keep one record per time period:
1. Add time bucket column (e.g., date_trunc)
2. DISTINCT on ID + time bucket
Troubleshooting
Too Many Duplicates Removed
- Review which columns are included in DISTINCT
- Check if you need GROUP BY instead
- Verify all relevant columns are considered
Duplicates Still Present
- Check if hidden columns differ
- Verify all necessary columns included
- Review NULL handling (NULLs may be considered distinct)
Wrong Row Kept
- Sort data before applying DISTINCT
- Consider using GROUP BY with aggregation
- Add priority column for sorting
Performance Issues
- Reduce number of columns
- Add indexes
- Filter data first
- Consider partitioning large datasets
NULL Value Handling
DISTINCT treats NULL values as unique:
- Multiple NULL values in different rows may be considered distinct
- Or all NULLs may be treated as identical (database-dependent)
- Check your database's NULL handling behavior
Related Operations
- Joining Tables - Often needed before DISTINCT to combine data
- Adding Aggregations - Alternative to DISTINCT for summarization
- Sorting - Often used before DISTINCT to control which row is kept
Related Formulas
For handling duplicates at row level:
- COALESCE - Handle NULL values
- Conditional Logic - Mark duplicates
Detailed screenshots showing the DISTINCT interface will be added soon.