Unique / Distinct

Removing duplicate rows ensures data quality and accuracy in your analysis. Visitran's No-Code UI provides easy tools to identify and eliminate duplicates.

Overview

The DISTINCT operation removes duplicate rows from your dataset, keeping only unique combinations of values. This is essential for data quality, accurate counting, and proper analysis.

When to Use DISTINCT

Data Quality

Remove unintentional duplicates caused by:

Data loading errors
Multiple source systems
Incorrect joins
System glitches

Accurate Counting

Ensure correct counts when:

Counting unique customers
Analyzing unique products sold
Tracking distinct events
Reporting unique values

List Generation

Create clean lists of:

Unique categories
Distinct product codes
Available options
Valid values

How DISTINCT Works

DISTINCT compares all visible columns in each row. Two rows are considered duplicates if all visible columns have identical values.

Example:

Before DISTINCT:
| customer_id | order_date | amount |
|------------|------------|--------|
| 100        | 2024-01-15 | 250.00 |
| 100        | 2024-01-15 | 250.00 | ← Duplicate
| 101        | 2024-01-16 | 180.00 |

After DISTINCT:
| customer_id | order_date | amount |
|------------|------------|--------|
| 100        | 2024-01-15 | 250.00 |
| 101        | 2024-01-16 | 180.00 |

How to Apply DISTINCT in Visitran

Step-by-Step Instructions

Select Your Model - Navigate to the model with duplicate rows
Add DISTINCT Operation - Click "Add Distinct" or enable unique rows
Choose Columns - Select which columns to consider for uniqueness
Preview Results - View the deduplicated data
Verify Count - Check that the expected number of rows remain
Save - Apply the changes to your model

All Columns DISTINCT

Remove rows where all columns are identical.

Use When: You want exact duplicate row removal.

Specific Columns DISTINCT

Consider only certain columns for uniqueness.

Use When: You want to keep the first occurrence based on specific fields.

Common Use Cases

Unique Customer List

DISTINCT on: customer_id
Result: One row per customer

Product Catalog

DISTINCT on: sku, product_name
Result: One row per unique product

Available Categories

DISTINCT on: category
Result: List of all categories

Unique Dates

DISTINCT on: order_date
Result: All dates with at least one order

DISTINCT vs. GROUP BY

DISTINCT	GROUP BY
Removes duplicates	Groups and aggregates
Keeps first occurrence	Calculates across group
Simple deduplication	Summary statistics
No aggregation	Requires aggregation

Example:

DISTINCT:
  Returns: One row per customer

GROUP BY customer_id:
  Returns: One row per customer WITH counts, sums, etc.

Best Practices

Understand Your Data - Know why duplicates exist before removing them
Keep Source Data - Create a new model rather than modifying source
Document Reasoning - Explain why DISTINCT is needed
Verify Results - Always check row counts before and after
Consider Ordering - If keeping "first" occurrence matters, sort first

Performance Considerations

Index Columns - Index columns used for DISTINCT checking
Reduce Columns - Fewer columns make DISTINCT faster
Early Filtering - Apply WHERE clauses before DISTINCT
Avoid Large Rows - Large text/binary columns slow down DISTINCT

Strategies for Handling Duplicates

Keep First Occurrence

Sort by priority column
Apply DISTINCT
First matching row is kept

Keep Last Occurrence

Sort by timestamp DESC
Apply DISTINCT
Most recent row is kept

Best Record Selection

Sort by quality indicators
Apply DISTINCT
Best quality record is kept

Advanced Deduplication

Conditional Deduplication

Remove duplicates based on certain conditions:

Filter to specific subset
Apply DISTINCT
Union with remaining data

Fuzzy Deduplication

For near-duplicates (require formulas or external tools):

Standardize text (UPPER, TRIM)
Remove special characters
Compare similarity scores

Time-Window Deduplication

Keep one record per time period:

1. Add time bucket column (e.g., date_trunc)
2. DISTINCT on ID + time bucket

Troubleshooting

Too Many Duplicates Removed

Review which columns are included in DISTINCT
Check if you need GROUP BY instead
Verify all relevant columns are considered

Duplicates Still Present

Check if hidden columns differ
Verify all necessary columns included
Review NULL handling (NULLs may be considered distinct)

Wrong Row Kept

Sort data before applying DISTINCT
Consider using GROUP BY with aggregation
Add priority column for sorting

Performance Issues

Reduce number of columns
Add indexes
Filter data first
Consider partitioning large datasets

NULL Value Handling

DISTINCT treats NULL values as unique:

Multiple NULL values in different rows may be considered distinct
Or all NULLs may be treated as identical (database-dependent)
Check your database's NULL handling behavior

Joining Tables - Often needed before DISTINCT to combine data
Adding Aggregations - Alternative to DISTINCT for summarization
Sorting - Often used before DISTINCT to control which row is kept

For handling duplicates at row level:

COALESCE - Handle NULL values
Conditional Logic - Mark duplicates

Detailed screenshots showing the DISTINCT interface will be added soon.

Overview​

When to Use DISTINCT​

Data Quality​

Accurate Counting​

List Generation​

How DISTINCT Works​

Example:​

How to Apply DISTINCT in Visitran​

Step-by-Step Instructions​

All Columns DISTINCT​

Specific Columns DISTINCT​

Common Use Cases​

Unique Customer List​

Product Catalog​

Available Categories​

Unique Dates​

DISTINCT vs. GROUP BY​

Example:​

Best Practices​

Performance Considerations​

Strategies for Handling Duplicates​

Keep First Occurrence​

Keep Last Occurrence​

Best Record Selection​

Advanced Deduplication​

Conditional Deduplication​

Fuzzy Deduplication​

Time-Window Deduplication​

Troubleshooting​

Too Many Duplicates Removed​

Duplicates Still Present​

Wrong Row Kept​

Performance Issues​

NULL Value Handling​

Related Operations​

Related Formulas​

Overview

When to Use DISTINCT

Data Quality

Accurate Counting

List Generation

How DISTINCT Works

Example:

How to Apply DISTINCT in Visitran

Step-by-Step Instructions

All Columns DISTINCT

Specific Columns DISTINCT

Common Use Cases

Unique Customer List

Product Catalog

Available Categories

Unique Dates

DISTINCT vs. GROUP BY

Example:

Best Practices

Performance Considerations

Strategies for Handling Duplicates

Keep First Occurrence

Keep Last Occurrence

Best Record Selection

Advanced Deduplication

Conditional Deduplication

Fuzzy Deduplication

Time-Window Deduplication

Troubleshooting

Too Many Duplicates Removed

Duplicates Still Present

Wrong Row Kept

Performance Issues

NULL Value Handling

Related Operations

Related Formulas