DataHog

Rationale and background:

Understanding how your storage space is being used is a key step in managing data.

This app builds a database of files stored in iRODS collections (such as the CyVerse data store), Amazon S3 buckets, or directories on your device, and allows you to search, sort, and compare them. It provides information about file sizes, types, and duplicated files.




Usage:

The launch page offers five options for importing file data into DataHog:

  1. iRODS: Use the iRODS API to import data from a specific collection. The options for importing files from the CyVerse data store are prefilled.
  2. .datahog File: Upload a .datahog file containing file data. These can be generated by a Python script which you can download and run on any machine.
  3. CyVerse: Use the CyVerse file search API to import any data stored in the data store. This method currently does not support exact duplicate matching, and may be slower than iRODS in some cases.
  4. S3 Bucket: Use your AWS access keys to import an S3 bucket, or a specific directory from one.
  5. Restore Database: If you previously backed up a DataHog database, you can upload it to restore your data.

Depending on how many files are being scanned, the import process can take a few minutes to complete. Some extremely large directories (millions of files) may take much longer–feel free to close the tab and check up on it later if you wish.

Once the import process for your first file source is complete, you will have access to 4 tabs:

  1. Summary: View a summary of each of your file sources, including various file rankings and visualizations.
  2. Browse Files: Explore the folder structure for each of your file sources, or search your files using names, regex expressions, or date and size filters. Each column header can be clicked to sort the table by that value.
  3. Duplicated Files: View a list of files with identical contents. By default, this page uses checksums to compare files, but file sizes or names can also be used. Each column header can be clicked to sort the table by that value.
  4. Manage File Sources: Import a new file source, remove an existing one, or download a backup of the current file database.

Mandatory arguments

None

Author

Chris Klimowski (UA Data Science Institute: Data7)