Skip to main content

Duplicate file finder/Remover using perl and SHA1

When you are using a computing devices (either a laptop or PC or a Tab) for your personal use after some time (let take some years) you will realise that your disk is full and most of the space are occupied by duplicate files (Same copy of file located in different locations).

For ex: You might have a favourite music file in "My Favourite" folder as well as in the "Album" folder. But finding this duplicate manually is a biggest process. That too if the file names are different OMG!.

There are lot of free utilities available to do this in automated way, but if you are a programmer, you will always prefer to do it on your own.

Here are the steps we are going to do. This is purely on a linux - Ubuntu system.  (for windows you might need to change the path as per conventions )
  • Getting SHA1 for all the files recursively in a given directory
  • Compare SHA1 with other files
  • Remove the duplicate file
Getting SHA1 of a file

Using cpan module  Digest::SHA1 we can get SHA1 for a file data as follows

use Digest::SHA1 'sha1_hex';
use File::Slurp;
my $fdata = read_file($file);
my $hash = sha1_hex($fdata);

In the above code I used read_file method which is provided by File::Slurp module.

To find SHA1 for all the files recursively in a directory. There are many modules available in www.cpan.org for iterating a directory but my favourite is always File::Find module which works same like a unix find command.

use File::Find;
use File::Slurp;
use Digest::SHA1 'sha1_hex';

my $dir = "./";

# Calls process_file subroutine for each file
find({ wanted => \&process_file, no_chdir => 1 }, $dir);

sub process_file {
    my $file = $_;
    print "Taking file $file\r\n";

    if( -f $file and $file ne '.' and $file ne '..' ){
        my $fdata = read_file($file);
        my $hash = sha1_hex($fdata);
    }
}
Finding the duplicates

Our next step is to find the duplicates based on the SHA1 values found above. I am going to use Hash ref with key as a SHA1 value and values as an Array ref with list of file Path. So once we process all the files we can easily get the list of duplicate files by just getting length of the array.

use File::Find;
use File::Slurp;
use Digest::SHA1 'sha1_hex';

my $dir = "./";
my $file_list;

# Calls process_file subroutine for each file
find({ wanted => \&process_file, no_chdir => 1 }, $dir);

sub process_file {
    my $file = $_;
    print "Taking file $file\r\n";
    if( -f $file and $file ne '.' and $file ne '..' ){
        my $fdata = read_file($file);
        my $hash = sha1_hex($fdata);

     push(@{$file_list->{$hash}}, $file );
    }
}

Removing the duplicates

Now we have we have list of duplicate files found. Only thing left is removing the those files by keeping only one copy of them. Perl has a default command called unlink which will remove the file from that location.

unlink "$file"

Now combine everything and add some printing statements and options you will get a nice utility script to remove the duplicate files.

#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use File::Slurp;
use Digest::SHA1 'sha1_hex';


my $dir = shift || './';
my $count = 0;
my $file_list = {};
my $dup_dir_list = {};
my $dup_file_count = 0;
my $dup_dir_count = 0;
my $removed_count = 0;

find({ wanted => \&process_file, no_chdir => 1 }, $dir);


foreach my $sha_hash (keys %{$file_list}){
    if(scalar(@{$file_list->{$sha_hash}} > 1)){

        # Number of duplicate files
        $dup_file_count = $dup_file_count + scalar(@{$file_list->{$sha_hash}}) - 1;
        my $first_file = 1;
        foreach my $file (@{$file_list->{$sha_hash}}){
            # Don't delete the first file
            if($first_file){
                $first_file = 0;
                next;
            }
            if((unlink "$file") == 1){
                print "REMOVED: $file\n";
                $removed_count = $removed_count + 1;
            }
        }
    }
}

print "********************************************************\n";
print "$count files/dir's traced\n";
print "$dup_dir_count duplicate name directories found\n";
print "$dup_file_count duplicate files found\n";
print "$removed_count duplicate files removed\n";
print "********************************************************\n";

sub process_file {
    my $file = $_;

    #print "Taking file $file\r\n";
    if( -f $file and $file ne '.' and $file ne '..'){
        my $fdata = read_file($file);
        my $hash = sha1_hex($fdata);

        push(@{$file_list->{$hash}}, $file );
        $count = $count + 1;

        local $| = 1;
        print "Processing file: $count\r";
    }
}


The above code will remove any duplicate files in a given directory based on SHA1 value for the data. Keep in mind that if you are having audio or video files which are downloaded from different sources might have different SHA1 values based on various conditions. So this script will remove only computer identical files and it does not have any AI to identify same video/audio/images. When we see an image as a human we can identify it easily but computer will see it as different files based on various properties like that image might be compressed or resolution might have changed etc.

Comments

  1. Long time Mac users have one problem in common, that is duplicates files in the system that not only clutters the precious space but also colonize it unnecessarily. The situation is most common amongst the photographers and those people who love to keep memories intact in the system. If these photos are not sorted now or kept in an organized manner, there could take up your precious system space. And you don’t want that to happen, right? Now the question comes how should I delete photos from Mac and which is the best duplicate photo finder for Mac. Keep scrolling, and you will find your answer regarding duplicate photo finders for Mac soon.

    ReplyDelete
    Replies
    1. Duplicate File Finder/Remover Using Perl And Sha1 >>>>> Download Now

      >>>>> Download Full

      Duplicate File Finder/Remover Using Perl And Sha1 >>>>> Download LINK

      >>>>> Download Now

      Duplicate File Finder/Remover Using Perl And Sha1 >>>>> Download Full

      >>>>> Download LINK Ry

      Delete
  2. deleting duplicate files on mac and windows is really tedious task but if any tool available for general user then task so easy. Here are the list of best duplicate photo finder for mac to delete duplicate photos on mac.

    ReplyDelete
  3. Long time Mac users have one problem in common, that is duplicates files in the system that not only clutters the precious space but also colonize it unnecessarily. The situation is most common amongst the photographers and those people who love to keep memories intact in the system. If these photos are not sorted now or kept in an organized manner, there could take up your precious system space.

    ReplyDelete



  4. Great set of tips from the master himself. Excellent ideas. Thanks for Awesome tips Keep it up
    allsoftwarepro.com
    duplicate-cleaner-pro-crack
    m4vgear-drm-media-converter-crac

    ReplyDelete
  5. I like your all post. You have done really good work. Thank you for the information you provide, it helped me a lot. wahabtech.net I hope to have many more entries or so from you.
    Very interesting blog.
    Duplicate Photo Finder Pro Crack

    ReplyDelete
  6. After looking through a few blog articles on your website,
    we sincerely appreciate the way you blogged.
    We've added it to our list of bookmarked web pages and will be checking back in the near
    future. Please also visit my website and tell us what you think.
    Great work with hard work you have done I appreciate your work thanks for sharing it.
    Duplicate Photo Cleaner Crack

    ReplyDelete
  7. Duplicate File Finder/Remover Using Perl And Sha1 >>>>> Download Now

    >>>>> Download Full

    Duplicate File Finder/Remover Using Perl And Sha1 >>>>> Download LINK

    >>>>> Download Now

    Duplicate File Finder/Remover Using Perl And Sha1 >>>>> Download Full

    >>>>> Download LINK rG

    ReplyDelete
  8. After looking through a few blog articles on your website,
    we sincerely appreciate the way you blogged.
    We've added it to our list of bookmarked web pages and will be checking back in the near
    future. Please also visit my website and tell us what you think.
    Great work with hard work you have done I appreciate your work thanks for sharing it.
    crackbins.com Full Version Softwares Free Download
    Duplicate File Detective Crack

    ReplyDelete
  9. Thanks to share this article, check the top listed file finder programs and choose one out of them to clean duplicate documents in your computer.

    ReplyDelete

Post a Comment

Popular posts from this blog

Upgrade your kitchen with latest Tech Gadgets and Make cooking faster

Welcome to our latest blog post, where we embark on an exciting culinary journey to elevate your kitchen experience with cutting-edge technology gadgets. In this fast-paced world, cooking has become more than just a necessity; it's an art form that deserves the finest tools. Join us as we explore the innovative tech solutions that can revolutionize the heart of your home, making cooking not only faster but also more enjoyable and efficient. From smart appliances to intelligent cooking assistants, we'll introduce you to a range of futuristic kitchen companions that are sure to inspire your inner chef and transform the way you create delightful meals for yourself and your loved ones. Get ready to upgrade your kitchen and embrace the future of cooking!  In this blog post, we will narrow down the vast array of available kitchen tech gadgets and present only 10 items, carefully selected based on product ratings, price, and their suitability for Indian kitchens.  Are you a Telegram u

TataSky Refresh after Recharge - Activate after Recharge

If you are using TataSky and doing recharge occasionally and that too after disconnection, this post will be very helpful for you. Most of the time you will not get the channels listed as soon as you make the payment. It will display message to Subscribe for that channel. You can easily get channels back by simply giving a missed a call from your registered mobile number. Note: Make sure your TV and SetupBox is on and showing the error message Give missed call to  +91 80892 80892 (Soft Refresh)   wait for 1 minute and If still not active then try giving missed call to  +91 90405 90405 (Heavy Refresh). Ad: Planning to buy a Best low budget Smart TV? Consider  Acer 109 cm (43 inches) I Series 4K Ultra HD Android Smart LED TV AR43AR2851UDFL (Black)  - You can get this TV as low as for 20,000 Rs - which has Bluetooth, WiFi, Android and good customer ratings (4.4/5). Note: Price based on offers, click on the link to see current price on the web page If the above ste

Virtual Machine Storage - Fixed Size vs Dynamic Size

Virtual Machine Storage type Fixed size or Dynamic size? Which one Should I use? Whoever used a virtual machine in your lifetime you might have came across this prompt while creating a virtual machine.  You should have seen the option to choose from either Dynamically Allocated  Fixed Size And its obvious that many go for the default one Dynamically allocated.  Did you ever thought what is the advantage of Fixed size over Dynamic size? Here is where the difference I noticed Dynamically allocated This type of virtual disk file will not consume the actual physical storage in your hard disk but it shows as it has that many allocated size (free memory) available to your virtual OS. The advantages are: Actual Physical memory will not be occupied You can actually create a virtual disk even more than the physical drive size eg: you can create 1 TB virtual disk on a 128 GB physical storage. But here is a big drawback which many failed to consider. Since y