Need help with data cleaning

This is the place to discuss the episodes of the Comic Book Page podcast, the Comic Book Page website or pretty much anything else of interest to the Comic Book Page community...

Moderator: JohnMayo

Post Reply
User avatar
JohnMayo
Host/Owner
Posts: 3294
Joined: Mon Mar 12, 2007 3:12 pm
Location: Texas
Contact:

Need help with data cleaning

Post by JohnMayo »

Over the past few months, I've been rewriting the data cleaning process for the number crunching. I've gotten the workflow simplified and expedited considerably. It has kept me very busy over the past few months as I rewrote the entire process while continuing to work on and use the old process. Based on the speed differences I'm seeing, it will be worth the time spent. I'm now at the point where I need to group the differences between the results of the old and the new systems into common cases and resolve each group.

My goal for this stage of the project is to cut over to the new process as soon as it produces results as good as or better than the existing process. Currently, there are about 29,445 records with differing results out of a total of 1,049,285 records (with thousands more coming every month) split across eight or so different data sources. I am expecting the number of differences to chance after doing one last pass of the old system. Hopefully the number of differences drops dramatically. Once all of the differences are accounted for and addressed then I can flip over to the new system which would save me a ton of time every month on the number crunching.

What I need is help digging through the records with differing results and identifying what shortcomings there are in the new system and to help me define what "right" looks like in some cases. Don't worry about the programming side of things as I'll be doing all of the coding. I'm not looking for a huge time investment on this, just someone with an attention to detail willing to help me break things down to common cases and concrete actions for dealing with them.

I'd really like to be on the new system before the April 2013 sales data gets released in about four weeks. But I can't do that until the results of the new system are at least as good as the current system.

Is anybody willing to help me on this project?
Comic Book Page: Website || Podcast || RSS || Episodes Archive
Gilgabob
Special Reviewer
Posts: 356
Joined: Thu Jun 02, 2011 7:28 pm
Location: Chicago

Re: Need help with data cleaning

Post by Gilgabob »

I'd certainly be willing to help if I am qualified. I'm not exactly sure what you need done but if I can do something useful, I will.
Perry
Special Reviewer
Posts: 489
Joined: Sun Feb 13, 2011 7:02 am
Location: Virginia Beach, VA

Re: Need help with data cleaning

Post by Perry »

If I knew anyway I could be of assistance I would offer it without delay. Sadly, my technical prowess and expertise lies in my ability to turn my computer on and then, when needed, put it to sleep. Sorry, John.
:(
User avatar
JohnMayo
Host/Owner
Posts: 3294
Joined: Mon Mar 12, 2007 3:12 pm
Location: Texas
Contact:

Re: Need help with data cleaning

Post by JohnMayo »

Here is an example of what I'm talking about:

Image

The data clean up takes in an ItemCode and Description from a Data Source performs a series of data clean up steps on it and then spits out a Publisher, Title, TitleAddendum, Format, VolumeNumber, SubTitle, IssueNumber, IssueText and Variation.

For nearly 30,000 or so, the output of the two systems differ. I need to go through all of the cases, group them into common problems and then fix those problems. In some cases, the old system will be wrong. In some cases the new system will be wrong. In some cases, both system will be wrong. In this case, the old system got the Variation wrong.

The format of the actual data will be in a flatter Excel file format. I re-arranged the data in that image to make it easy to see what I'm taking about.

So no technical skill is really needed beyond being able to open up an Excel file and navigate in it.
Comic Book Page: Website || Podcast || RSS || Episodes Archive
User avatar
JohnMayo
Host/Owner
Posts: 3294
Joined: Mon Mar 12, 2007 3:12 pm
Location: Texas
Contact:

Re: Need help with data cleaning

Post by JohnMayo »

Perry wrote:If I knew anyway I could be of assistance I would offer it without delay. Sadly, my technical prowess and expertise lies in my ability to turn my computer on and then, when needed, put it to sleep. Sorry, John.
:(
Not a problem. I appreciate it just the same.
Comic Book Page: Website || Podcast || RSS || Episodes Archive
boshuda
Special Reviewer
Posts: 341
Joined: Mon Apr 04, 2011 8:59 am
Location: Western NY

Re: Need help with data cleaning

Post by boshuda »

Sorry I missed this post a few weeks ago, but if you still need some help with this let me know. I see what you're trying to do in your example. I can't guarantee any set number of hours, but I can probably commit to a at least a few hours per week to crank through some of this.
User avatar
JohnMayo
Host/Owner
Posts: 3294
Joined: Mon Mar 12, 2007 3:12 pm
Location: Texas
Contact:

Re: Need help with data cleaning

Post by JohnMayo »

I'm at the point where the first round of grouping has been done and now I need to implement those changes. Once I've done that, hopefully I can cut over to the new process and then start a pass of validating everything is coming out looking good. For that, extra sets of eyes would be extremely helpful.
Comic Book Page: Website || Podcast || RSS || Episodes Archive
Post Reply