Junk Drawer - DOTNET

Junk Drawer

Introduction
analyst: "Is there something that just automatically imports files into a database?"
nerd: "No. Use the data import wizard."
The data analyst sighed.. remembering the wizard...
Using the wizard to import a text file goes something like this:
  • Install SQL Server Management Studio.
  • Find Tasks and choose Import Data.
  • Select "Flat File Source."
  • Browse for the file.
  • Preview the data.
  • Specify the delimiter.
  • Specify if the first row is column names.
  • Preview the data (again).
  • Go to each column and choose the correct data type or use "Suggest Types" feature.
  • Choose if you want to save the SSIS package for later.
  • Execute it
nerd: "The wizard helps you import any kind of file."
analyst: "That's great. But, it would be better if I didn't have to answer so many questions and the program just analyzed and imported the file for me."
nerd: "You're dreaming. You have to use the wizard. Go back to your cube. Press the keys. Click the mouse. Do this until the Wizard says it successfully imported the file. If you get the fail message, you did something wrong. Respond to the error message and go back and fix it."
analyst: "I get a lot of slightly different files. Using the wizard is repetitive. This wastes a lot of my time..."
At this, the nerd put the data analyst in a head lock. Balling up his fist, he pressed his knuckles hard against the analyst's head; rubbing back and forth, causing a great deal of painful friction.
nerd: "You come to my cube, without a ticket, complaining about YOUR time being wasted?"
The nerd pushed the analyst away from him. As the analyst fell forward, the nerd Tae Kwon Doe'd him squarely in the buttocks.
nerd: "Now get out of here!"
Humiliated, and unable to match the nerd's mightiness, the data analysts sulked back to his cubicle.
---
Sadly, this scenario happens a lot in IT offices.
Just recently, while forcing an IT staff member to learn SQL, he asked me something similar:
staff: "How do I get the files into the database so I can query them?"
I began to explain the data import wizard (as described above), but it didn't feel right. I felt bad that he'd have to run the wizard every time. I knew it would take him forever. Also, I knew the conversation would most likely end in a destructive battle (as in the scenario depicted above). So, I thought there might be a better way.
---
I always tell people, that if you can think of it, then chances are it already exists. You just have to Google for it. I should have followed my own advice, but I didn't. See, I'm a programmer. I have these times where all I want to do is program. I'm not even a great programmer (as indicated by the comments below), but, I just love it, and I don't care what anyone thinks because it makes me happy, happy, happy.
So, I decided to bypass the Google search and create an open source project called Junk Drawer. As it turns out, the name Junk Drawer is already used for a variety of things, but it's too late to change the name. Anyway, its goal is to make the majority of Excel or text file importing a one step, no-brainer process. It should also reduce the likelihood of terrible brawls between data analysts and nerds.

Requirements

I don't want staff members to actually need a copy of Junk Drawer installed on their computer. I just want them to drop a file on a network share and have it imported. For myself, since I'm going to have the program obviously, I'd like a right-click option to "Send to Junk Drawer." For both requirements, all I need is a .NET Console application. I can enable the file monitoring and action trigger with any capable job automation software (i.e. Visual Cron) and I can do the right-click thing with a registry hack.

A Demo

Before you run Junk Drawer for the first time, make sure you create a "Junk" database for all the files you're going to import. Then, put a SQL Server connection string in the configuration file jd.exe.config.
<transformalize>
<processes>
<add name="JunkDrawer">
<connections>
<add name="output" connection-string="connection-string" />
</connections>
</add>
</processes>
</transformalize>
Here are the contents of a sample text file:
Name,Birthday,Points
Dale,3/3/1981 9 AM,73
Tara,12/31/1990,1042
Grace,9/9/2000 11 PM,56
Gavin,7/3/2010,13
To import it, just run:
jd c:\sample.txt 
Junk Drawer (jd.exe) imports the file, and now I can go query it:
SELECT Name, Birthday, Points FROM sample;  
Name Birthday Points
----- ----------------------- -----------
Dale 1981-03-03 09:00:00.000 73
Gavin 2010-07-03 00:00:00.000 13
Tara 1990-12-31 00:00:00.000 1042
Grace 2000-09-09 23:00:00.000 56
The table structure looks something like this:
CREATE TABLE sample(
BirthDay DATETIME,
Name NVARCHAR(5),
Points INT
);

How Does it Work?

To be able to import a text file into a database, JD has to figure out three things:
  1. the delimiter
  2. the column names (if available)
  3. the column data types
For Excel files, you can skip the first step.

Delimiters

First, a number of lines are loaded from the file. Then, popular delimiters are counted in each line. If the number of delimiters is greater than zero, and the same number of delimiters is found in each line, then that delimiter may be the delimiter. That is to say, if the same number of commas are found in each line, then odds are this is a comma delimited file.

Column Names

To determine column names, the first row is split by the delimiter and tested. We don't know if it contains column names, or it is merely the first record in the file. So, I run these tests:
  • Are there any duplicate field values?
  • Are there any empty values?
  • Are there any white space values?
  • Are there any numeric values?
  • Are there any date time values?
If any of the answers are "Yes," then the first line cannot be used as column names. If this happens, default column names are generated (i.e. A, B, C, etc.). In the example above; Name, Birthday, and Points answer "No" to all these questions and make good column names.

Data Types

Rather than write new code to efficiently run through a file and check data types, I have elected to use a library called Transformalize. The guy who wrote it is AWESOME! Well I'm just kidding, he's not that awesome. It'sbased on Rhino ETL, which was created by Ayende, who actually IS awesome! So, it inherits some degree of awesomeness.
Given that we know the file location, the delimiter, and the column names, I am able to "configure" (not code) Transformalize to load all the records and perform "type conversion" validation on every value.
If every value in the column passes a data type validation, then I use that data type for database storage. For example, if all values in a column respond "Yes," to the question "Are you an integer?," then I store it in anINTEGER data type. If there are mixed or failing validation results, I default to use NVARCHAR(x). Anything can go into an NVARCHAR data type. It's just a variable number of Unicode characters, where x is set to the maximum length found in the column.
The data types are the final piece of information we need to import the file into a database. Again, instead of writing new code, I configure Transformalize to take care of importing the file into the database.

Some Code?

There's nothing spectacular in the source code. But, this site is called CodeProject, so, its kind of frowned upon to not include code in your article. So, we'll just take a look at it's entry point.
public class Program {

static void Main(string[] args) {

GlobalDiagnosticsContext.Set("process", Common.LogLength("JunkDrawer", 16));
GlobalDiagnosticsContext.Set("entity", Common.LogLength("Request"));

var request = new Request(args);

if (!request.IsValid) {
LogManager.GetLogger(string.Empty).Error(request.Message);
Environment.Exit(1);
}

new FileImporter().Import(request.FileInfo);
}
}
There's three things happening:
  1. I'm setting some global variables for NLog.
  2. I'm validating the Request:
    1. Did the user pass in a file name?
    2. Is the file there?
    3. Is it in use? I give up after 5 tries.
  3. I'm importing the file with FileImporterNote: It's in its own library so you can reference and use in your program if you want.
If you're curious at all, you're welcome to explore the source code. More likely, the only time you'll want to look at the source code is if you run Junk Drawer and it doesn't import your file right. Like, for example, I'm only searching for commas, pipes, semi-colons, and tabs for delimiters... You might have files that are delimited by 6's or something, I don't know. So, you could create a fork on GitHub and add "6" in the list of delimiters to check for, or better yet, make it configurable!

Conclusion

Well that's the Junk Drawer in a nut-shell. I called it Junk Drawer because importing files directly into a database can get messy. It ends up an uncontrolled staging area for data. Nevertheless, it empowers your trusted friends to put some data in there and run ad-hoc queries until their heart's content.
Now if you're the acting DBA, and you put your "Junk" database on an important server, make sure you have monitors in place for disk space, CPU abuse, and excessive resource blocking. You should have this all setup regardless if you have Junk Drawer running, but especially if you do. Remember to control access to both your network share and your Junk database with Active Directory groups. Smile | :)
Thanks for reading.

Updates

More Configuration

Since the original article, I added more configuration. Here's a sample:
  <junkdrawer>
<types sample="100" default="string" default-length="1024" >
<add type="boolean" />
<add type="byte" />
<add type="int16" />
<add type="int32" />
<add type="int64" />
<add type="single" />
<add type="double" />
<add type="datetime" />
</types>
</junkdrawer>
The types collection gives you control of what data types are checked for. You may remove them all. If you do, every field will assume the default type (e.g. string). This can be useful it you don't care about types, or are troubleshooting.
The sample attribute allows you to reduce the amount of data type inspection. The lower the percentage, the fewer records (and fields) get checked. You may set this to 100, or remove it altogether if you want to check everything (the original behavior).
Copyright © 2015 DOTNET All Right Reserved