Over the past few months, I've been rewriting the data cleaning process for the number crunching. I've gotten the workflow simplified and expedited considerably. It has kept me very busy over the past few months as I rewrote the entire process while continuing to work on and use the old process. Based on the speed differences I'm seeing, it will be worth the time spent. I'm now at the point where I need to group the differences between the results of the old and the new systems into common cases and resolve each group.
My goal for this stage of the project is to cut over to the new process as soon as it produces results as good as or better than the existing process. Currently, there are about 29,445 records with differing results out of a total of 1,049,285 records (with thousands more coming every month) split across eight or so different data sources. I am expecting the number of differences to chance after doing one last pass of the old system. Hopefully the number of differences drops dramatically. Once all of the differences are accounted for and addressed then I can flip over to the new system which would save me a ton of time every month on the number crunching.
What I need is help digging through the records with differing results and identifying what shortcomings there are in the new system and to help me define what "right" looks like in some cases. Don't worry about the programming side of things as I'll be doing all of the coding. I'm not looking for a huge time investment on this, just someone with an attention to detail willing to help me break things down to common cases and concrete actions for dealing with them.
I'd really like to be on the new system before the April 2013 sales data gets released in about four weeks. But I can't do that until the results of the new system are at least as good as the current system.
Is anybody willing to help me on this project?
Need help with data cleaning
Moderator: JohnMayo
Re: Need help with data cleaning
I'd certainly be willing to help if I am qualified. I'm not exactly sure what you need done but if I can do something useful, I will.
Re: Need help with data cleaning
If I knew anyway I could be of assistance I would offer it without delay. Sadly, my technical prowess and expertise lies in my ability to turn my computer on and then, when needed, put it to sleep. Sorry, John.
Re: Need help with data cleaning
Here is an example of what I'm talking about:
The data clean up takes in an ItemCode and Description from a Data Source performs a series of data clean up steps on it and then spits out a Publisher, Title, TitleAddendum, Format, VolumeNumber, SubTitle, IssueNumber, IssueText and Variation.
For nearly 30,000 or so, the output of the two systems differ. I need to go through all of the cases, group them into common problems and then fix those problems. In some cases, the old system will be wrong. In some cases the new system will be wrong. In some cases, both system will be wrong. In this case, the old system got the Variation wrong.
The format of the actual data will be in a flatter Excel file format. I re-arranged the data in that image to make it easy to see what I'm taking about.
So no technical skill is really needed beyond being able to open up an Excel file and navigate in it.
The data clean up takes in an ItemCode and Description from a Data Source performs a series of data clean up steps on it and then spits out a Publisher, Title, TitleAddendum, Format, VolumeNumber, SubTitle, IssueNumber, IssueText and Variation.
For nearly 30,000 or so, the output of the two systems differ. I need to go through all of the cases, group them into common problems and then fix those problems. In some cases, the old system will be wrong. In some cases the new system will be wrong. In some cases, both system will be wrong. In this case, the old system got the Variation wrong.
The format of the actual data will be in a flatter Excel file format. I re-arranged the data in that image to make it easy to see what I'm taking about.
So no technical skill is really needed beyond being able to open up an Excel file and navigate in it.
Re: Need help with data cleaning
Not a problem. I appreciate it just the same.Perry wrote:If I knew anyway I could be of assistance I would offer it without delay. Sadly, my technical prowess and expertise lies in my ability to turn my computer on and then, when needed, put it to sleep. Sorry, John.
Re: Need help with data cleaning
Sorry I missed this post a few weeks ago, but if you still need some help with this let me know. I see what you're trying to do in your example. I can't guarantee any set number of hours, but I can probably commit to a at least a few hours per week to crank through some of this.
Re: Need help with data cleaning
I'm at the point where the first round of grouping has been done and now I need to implement those changes. Once I've done that, hopefully I can cut over to the new process and then start a pass of validating everything is coming out looking good. For that, extra sets of eyes would be extremely helpful.