Q-Built Solutions Makes Quick, Quality-Built Software For Your Business Needs!
 

Data Cleanup

Place your ad here! This domain gets 400 visitors per day and has Google Page Rank of 3 or 4.  Get great ad rates!

Place your ad here! This domain gets 400 visitors per day and has Google Page Rank of 3 or 4.  Get great ad rates!


 
Please help us write more free software and tips for this site by donating.
Please donate today!


Q-Built Solutions Web Statistics
.

 
Statistics As Of:
26 Mar. '07

Number of unique visitors since 20 March '04:
204,793

Number of Web pages served since 20 March '04:
429,683

Our Most Popular Web Pages:

 1.  How-To Tips
 2. 
Technical Articles
 3. 
Gem Tips
 4. 
VBA
 5. 
Free Stuff
 6. 
Links
 7. 
Free Downloads
 8. 
Our Custom Microsoft Access Products
 9.
Forms
10. 
FAQ's
 

 

 

 

 

 

 

How To Clean Up Data

Part 1:  Web Pages And Text File Data Sources

By D. C. Conlin

Page 1 of 2

(Go to next page)

Database developers are often given data that needs to be "cleaned up" before it can be useful in a relational database.  This article is the beginning of a series of articles on how to quickly clean up "dirty" data plagued by common problems encountered while developing Microsoft Access databases. You may download the sample files, DataCleanup_Web.zip, for this example.

(Please note that to understand this article, a thorough knowledge of beginning SQL and VBA functions for string manipulation is required. Some intermediate level SQL will be helpful as well. If your experience with queries has been solely with the Microsoft Access QBE query grid, then you need to pick up a good book on SQL or take a beginning SQL course to help you understand the queries presented in this article.)

(Thousand Oaks, CA -- Feb. 1, 2005) -- When I began working as a database developer for a large manufacturer, I inherited a number of database applications that had been built by my predecessors, only one of whom was a database developer by vocation. The rest of my predecessors were experts at industrial manufacturing processes, tool making, shipping and receiving, and managing and organizing people, but were not experts at organizing relational data. As a result, I had a full-time job fixing almost-functioning applications and redesigning database applications so that they could be enhanced when further needs were identified for the applications. If there was time left over, I could design and create new database applications to help automate the data gathering and reporting processes that many departments had been tracking with spreadsheets.

At one point, a manager approached me and requested that a few simple reports be added to one Access database application that his department relied upon to track certain manufacturing processes of custom parts for his department's monthly reports. He was particularly concerned with being able to see three items for the current month:

    1.)  Whether any manufactured parts didn't conform to required specifications.

    2.)  Which employee was responsible for repairing the part and returning it to the production line (or starting over with new materials if the part couldn't be fixed).

    3.)  How long the part had been waiting to be fixed and returned to manufacturing.

I suggested adding functionality to the reports that would enable the manager to view any month at a glance, not just the current month, so that current productivity could be compared with past productivity and to help identify any trends that may be affecting the productivity of the manufacturing processes.

Of course, this database application had been in use for two and a half years before I arrived, so the technicians had been entering this data all along.  If they had known how to use the built-in filter capabilities of Access to filter the records in the data entry forms, they could easily have retrieved the part numbers of any items that were awaiting repairs and then follow up on the information that the manager was requesting in the reports. It would be easy to produce these reports, including the one that showed this data with the time difference between discovery of a "non-conformance" issue and the repair, as well as any parts that were still awaiting repair during any given time period.

Or so I thought.

After examining the database and the application, I discovered that it was difficult to determine exactly which parts didn't meet specifications because information on manufactured parts was stored in multiple tables, depending upon which process was used in manufacturing the part.  Part numbers were not unique across tables and not unique within each table.  Each record in these tables was identified by an AutoNumber primary key starting with 1.  The non-conformance information was saved in a separate table, and each record in this table cross-referenced the affected part in a multivalued field.  This multivalued field was an acronym for the relevant table name along with the AutoNumber primary key in that table.

Users trying to trace which table contained relevant information about the part had to guess which form to open to view the record.  Even when users guessed correctly and opened the right form, they often filtered by the part number instead of the primary key, thereby identifying the first record for that part number but not necessarily the record for the part with the non-conformance issue.  This led to confusion, data integrity issues and the practice of opening the tables directly to "fix" the data afterwards.

It was also difficult to determine which employees were responsible for fixing any of the "non-conformance" issues because database users had been allowed to type other employees' names into text boxes in the application.  As a result of two and a half years of such entries, there was an average of five different ways to spell each employee's name.  Whenever an employee attempted to check whether any parts had been assigned to him but only filtered on the correct spelling of his name, the filter left out manufactured parts that had been assigned to that employee under all of the other versions of the employee's name. This allowed the opportunity for non-conformance items to sit on the shelf with the responsible employee unaware that the parts needed his immediate attention.

After redesigning the table structures, redesigning the application to accommodate the new table structures, cleaning up the data, and replacing the text boxes with combo boxes listing employees' names for users to select (thereby preventing the users from mistyping any employee's name into the forms), the requested reports were created.  These reports clearly revealed the fact that a number of manufactured parts had been sitting on the shelf awaiting repairs for more than six months, most of which had only been fixed in the two days preceding the manager's request for these reports to track the non-conformance issues.

 

Consequences

No wonder the manager had wanted these reports!  He needed to know whether or not any other parts had slipped through the cracks. This was a major concern, because the company was in the middle of laying off about 80% of the employees at the facility. As each group of employees left, there was a danger that job responsibilities might not be passed on to whomever was still employed.

Further investigation revealed that these employees, even though they knew they were about to lose their jobs, were true professionals.  Every task that needed to be completed was either finished or passed on to the remaining employees.  The manufactured parts with non-conformance issues that had been left idle for months were actually parts that the customers had put on hold, due to pending cancellation or suspected future cancellation of contracts.  The parts that had been awaiting repairs for months and were suddenly repaired within the two days preceding the request for the reports were repaired because these tasks had just been approved by the customer, who had finally released funding after months in limbo.

Not one of these manufactured parts had slipped through the cracks nor waited on a shelf for an employee to discover that it needed to be fixed. Had the employees been less conscientious of their responsibilities in the chaotic environment of massive layoffs, the outcome certainly could have been otherwise.

It's a good thing that these employees were so diligent about their duties and we had an opportunity to clean up the data and redesign the database to produce these accurate reports, because this database was later scrutinized in a fashion that few database developers ever get a chance to experience.  (There's nothing like having an authority even higher than your boss's boss's boss's boss coming behind you and checking your work.)  This database application, along with many other databases, files, documents, and E-mails were part of one of the most extensive accident investigations in recent history. You see, many of the parts tracked by this database had been installed on NASA's Space Shuttle Columbia.  STS-107 disintegrated during re-entry into the Earth's atmosphere on February 1, 2003, the morning that the world lost seven heroes, Rick Husband, Willie McCool, Michael Anderson, David Brown, Kalpana Chawla, Laurel Clark, and Ilan Ramon, and a nation lost a national treasure, the Columbia. The investigation revealed, among many things, that the parts and equipment installed on the Space Shuttle Columbia did not contribute to the accident in any way.

Make sure that the data in every database you work on is cleaned up well enough to prove whatever the information the database application was designed to track.  You never know when it's going to count -- or come back to haunt you and your organization.

 

Gathering "dirty" data

Data sources can be gathered from many internal and external organizations.  Each organization has different standards for entering data. Ideas about what data is important differ widely from one organization to the next. The database developer is required to make sense out of the disparate data, clean it up, then store it properly so that reports and forms show accurate, meaningful information. Tools to clean up the data are provided by data warehousing software that will extract, transform, and load (ETL) the data into the data warehouse, but no such tools are available for the smaller databases on the market. Nonetheless, with observation, forethought and a little creativity, Access can extract, transform, and load the raw data into a table where the cleaned up data can be used to produce accurate reports.

For this article, we'll examine an easily obtained example of "dirty" data that has many of the common problems that database developers encounter and need to fix quickly and easily.  Suppose we wanted to know how many Microsoft Access MVP's (Most Valued Professionals) were based in Germany.  Where would we find such information?  Microsoft keeps a current list on the Internet:

http://mvp.support.microsoft.com/default.aspx?scid=fh;EN-US;mvpaward&style= TOC

Follow the link and scroll down the Web page until you see the Microsoft Office Access section.  Notice that the names are alphabetized by first name, not organized by country.  To determine how many MVP's are from Germany, we'd have to manually count them. Fortunately, this doesn't take much effort.  But what if we wanted to know which country boasted the most MVP's and the names of these MVP's?  To start with, a count of MVP's from each country would be necessary in order to compare the countries. Suddenly this becomes a more tedious task.  It's only a matter of names and countries -- which most will admit isn't at all complex -- but it won't be a simple task to assemble the information we want from this particular data source.

This scenario often happens when data is stored in a spreadsheet for a single purpose, but it's discovered that the data would be much more valuable if the data were presented in a different layout or additional associated data were added to the original layout. While this can be difficult or even impossible to accomplish with a spreadsheet, a relational database such as Microsoft Access provides the tools to store and quickly and easily retrieve and organize the data in a much more flexible manner.

For this example, copy the Access MVP names and their countries, and then paste this data into Notepad.  Save the file as MVPList.txt.

 

Examine the data for patterns and look for discrepancies that don't fit the patterns

Take a close look at the data you've saved.  Do you see any problems with separating the individual names from the countries?  We could use the space character as a delimiter between fields, but most of these fields contain people's names, and handling names can be problematic.  Some names are Spanish and have multiple surnames, and some are Asian, which commonly put the surname before the given name, the reverse of Western cultures.  In the latter case, the surnames may have already been reversed before being placed on the Web page. (For our example, we'll assume that the Asian surnames have been placed after the given name.) In addition, some MVP's have middle names and some do not.

Which column should be identified as the "Countries" column if we import the data as-is into a table?  Sometimes the country is in the third column, other times the fourth or fifth column -- and even the sixth column in one instance. We could import the records and use a query to identify the last field as the name of the country, but that leaves New Zealand without the word "New" in the "Country" column in several records.

A few of the names include nicknames.  Do we include these or drop them? And what about those question marks and parentheses in the last record?  Is that mistranslated from Unicode or someone's marks that it was a guess when the name was typed?  Or is it an Anglicized version of the name above it?  If you are familiar with some of the MVP's names, you'll notice that there are some repeats in a few of the names, which need to be removed.

Clearly this data needs some work before it will be usable in a relational database for accurate reports.  Decisions need to be made on how to ultimately store the data in the database.  This is where domain experts are invaluable.  Domain experts are people who handle the data, often daily, and know how the data will be used in the organization.  Domain experts can look at the raw data and often explain the anomolies we found in our inspection, although not always. For example, a domain expert can make a decision on whether the last record is merely another version of the name in the record before it, or what corrections need to be made in the last record.

We can easily import the data into a database with the Import Text Wizard by using the space as a delimiter, then manipulate each record from there.  Or can we?  If you try this method you'll find that the 6th field isn't imported, which cuts off data before the end of one of the records. This often happens when importing data from a non-database, because the data doesn't always fit into neat little boxes. One solution is to start over and manipulate the choices in the Import Text Wizard until we get it right, then build queries to parse the data into separate fields in order to clean up and normalize the data.  Another approach, which is often faster, is to just create a new "generic" table manually and import the data from the data file with a query, then build the queries to parse the data into separate fields. This approach has the benefit of retrieving all of the data in each record without cutting any of it off arbitrarily.

 

Import the data

First, create a new table. Add the ID field, which will be an AutoNumber and the primary key of the table.  Next, add the TextFld field, which will be Text and hold 255 characters.  Save the table as tblImport and close it.

Since we have a fairly simple data file which has no headers for the field names, and we have no schema.ini file to govern how the data file is formatted when being inserted into the table, we can either take the time to rectify this or we can build a query that manages these things for us.  We'll do the latter for this example.

Create a new query.  In the SQL View pane, paste the following SQL statement, but change the directory listed in the FROM clause (C:\Test\) to the directory where you saved your MVPList.txt file:

  INSERT INTO tblImport ( TextFld )
  SELECT F1 AS TextFld
  FROM [TEXT;HDR=NO;DATABASE=C:\Test\].MVPList.txt;

In this SQL statement, the F1 is a placeholder for the field name, or header, of the first field in the file. Since we want this to be inserted into the TextFld field in the table, we'll alias it with the TextFld name in the query so that the field names match. The FROM clause may seem curious, since it's not a table name. Nonetheless, Jet will recognize it as a text file data source when using the above syntax.  Jet also recognizes the directory where the file is located, the name of the file, and the lack of field names in the first row of the file.

Next, save the new query.  Run this query to import the data into the tblImport table.

(Go to next page)

 

Copyright © 2005 - 2006 Q-Built Solutions. All rights reserved.

 Top

Visitors since 10 Mar. '05: 

 

[MS Access] [Free Stuff] [Articles] [Gem Tips] [How-To Tips] [Links] [Products] [Scorecard] [About Us] [Search]

Sign up for PayPal and start accepting credit card payments instantly.