Skip to main content

Unique / Distinct

Removing duplicate rows ensures data quality and accuracy in your analysis. Visitran's No-Code UI provides easy tools to identify and eliminate duplicates.

Overview

The DISTINCT operation removes duplicate rows from your dataset, keeping only unique combinations of values. This is essential for data quality, accurate counting, and proper analysis.

When to Use DISTINCT

Data Quality

Remove unintentional duplicates caused by:

  • Data loading errors
  • Multiple source systems
  • Incorrect joins
  • System glitches

Accurate Counting

Ensure correct counts when:

  • Counting unique customers
  • Analyzing unique products sold
  • Tracking distinct events
  • Reporting unique values

List Generation

Create clean lists of:

  • Unique categories
  • Distinct product codes
  • Available options
  • Valid values

How DISTINCT Works

DISTINCT compares all visible columns in each row. Two rows are considered duplicates if all visible columns have identical values.

Example:

Before DISTINCT:
| customer_id | order_date | amount |
|------------|------------|--------|
| 100 | 2024-01-15 | 250.00 |
| 100 | 2024-01-15 | 250.00 | ← Duplicate
| 101 | 2024-01-16 | 180.00 |

After DISTINCT:
| customer_id | order_date | amount |
|------------|------------|--------|
| 100 | 2024-01-15 | 250.00 |
| 101 | 2024-01-16 | 180.00 |

How to Apply DISTINCT in Visitran

Step-by-Step Instructions

  1. Select Your Model - Navigate to the model with duplicate rows
  2. Add DISTINCT Operation - Click "Add Distinct" or enable unique rows
  3. Choose Columns - Select which columns to consider for uniqueness
  4. Preview Results - View the deduplicated data
  5. Verify Count - Check that the expected number of rows remain
  6. Save - Apply the changes to your model

All Columns DISTINCT

Remove rows where all columns are identical.

Use When: You want exact duplicate row removal.

Specific Columns DISTINCT

Consider only certain columns for uniqueness.

Use When: You want to keep the first occurrence based on specific fields.

Common Use Cases

Unique Customer List

DISTINCT on: customer_id
Result: One row per customer

Product Catalog

DISTINCT on: sku, product_name
Result: One row per unique product

Available Categories

DISTINCT on: category
Result: List of all categories

Unique Dates

DISTINCT on: order_date
Result: All dates with at least one order

DISTINCT vs. GROUP BY

DISTINCTGROUP BY
Removes duplicatesGroups and aggregates
Keeps first occurrenceCalculates across group
Simple deduplicationSummary statistics
No aggregationRequires aggregation

Example:

DISTINCT:
Returns: One row per customer

GROUP BY customer_id:
Returns: One row per customer WITH counts, sums, etc.

Best Practices

  1. Understand Your Data - Know why duplicates exist before removing them
  2. Keep Source Data - Create a new model rather than modifying source
  3. Document Reasoning - Explain why DISTINCT is needed
  4. Verify Results - Always check row counts before and after
  5. Consider Ordering - If keeping "first" occurrence matters, sort first

Performance Considerations

  • Index Columns - Index columns used for DISTINCT checking
  • Reduce Columns - Fewer columns make DISTINCT faster
  • Early Filtering - Apply WHERE clauses before DISTINCT
  • Avoid Large Rows - Large text/binary columns slow down DISTINCT

Strategies for Handling Duplicates

Keep First Occurrence

1. Sort by priority column
2. Apply DISTINCT
3. First matching row is kept

Keep Last Occurrence

1. Sort by timestamp DESC
2. Apply DISTINCT
3. Most recent row is kept

Best Record Selection

1. Sort by quality indicators
2. Apply DISTINCT
3. Best quality record is kept

Advanced Deduplication

Conditional Deduplication

Remove duplicates based on certain conditions:

1. Filter to specific subset
2. Apply DISTINCT
3. Union with remaining data

Fuzzy Deduplication

For near-duplicates (require formulas or external tools):

  • Standardize text (UPPER, TRIM)
  • Remove special characters
  • Compare similarity scores

Time-Window Deduplication

Keep one record per time period:

1. Add time bucket column (e.g., date_trunc)
2. DISTINCT on ID + time bucket

Troubleshooting

Too Many Duplicates Removed

  • Review which columns are included in DISTINCT
  • Check if you need GROUP BY instead
  • Verify all relevant columns are considered

Duplicates Still Present

  • Check if hidden columns differ
  • Verify all necessary columns included
  • Review NULL handling (NULLs may be considered distinct)

Wrong Row Kept

  • Sort data before applying DISTINCT
  • Consider using GROUP BY with aggregation
  • Add priority column for sorting

Performance Issues

  • Reduce number of columns
  • Add indexes
  • Filter data first
  • Consider partitioning large datasets

NULL Value Handling

DISTINCT treats NULL values as unique:

  • Multiple NULL values in different rows may be considered distinct
  • Or all NULLs may be treated as identical (database-dependent)
  • Check your database's NULL handling behavior

For handling duplicates at row level:


Detailed screenshots showing the DISTINCT interface will be added soon.