Q-Built Solutions Makes Quick, Quality-Built Software For Your Business Needs!
 

Data Cleanup

Place your ad here! This domain gets 400 visitors per day and has Google Page Rank of 3 or 4.  Get great ad rates!

Place your ad here! This domain gets 400 visitors per day and has Google Page Rank of 3 or 4.  Get great ad rates!


 
Please help us write more free software and tips for this site by donating.
Please donate today!


Q-Built Solutions Web Statistics
.

 
Statistics As Of:
26 Mar. '07

Number of unique visitors since 20 March '04:
204,793

Number of Web pages served since 20 March '04:
429,683

Our Most Popular Web Pages:

 1.  How-To Tips
 2. 
Technical Articles
 3. 
Gem Tips
 4. 
VBA
 5. 
Free Stuff
 6. 
Links
 7. 
Free Downloads
 8. 
Our Custom Microsoft Access Products
 9.
Forms
10. 
FAQ's
 

 

 

 

 

 

 

How To Clean Up Data

Part 1:  Web Pages And Text File Data Sources, Continued

By D. C. Conlin

Page 2 of 2

(Go to previous page)

Prepare for data transformation

Several steps need to be taken in order to facilitate the transformation of this raw data into "clean" data.  We'll need to fix the discrepancies we found earlier when examining the raw data.
 

  Identify multi-worded values that need to stay together as one field

First, we'll modify all multi-worded countries into a single word so that these words stay together during the transformation process. We'll do so by inserting an underscore between words temporarily, then remove the underscore at the end of our data cleanup. In our example, New Zealand is the only multi-worded country, so we only need to be concerned with one multi-worded value in the query.

Create a new query and paste the following SQL statement into it:

  UPDATE tblImport
  SET TextFld = Replace(TextFld,"New Zealand","New_Zealand",1,1)
  WHERE (Instr(TextFld,"New Zealand") > 0);

Save the query, then run it.
 

  Identify multi-worded values that need to be placed in more than one field

We also need to differentiate the initials of Mr. Scofield so that the first initial will be saved in the first name field and the second initial will be saved in the middle name field.  We can do this with a query that adds a space after the period after the first initial.

Create a new query and paste the following SQL statement into it:

  UPDATE tblImport
  SET TextFld = Replace(TextFld, ".", ". ", 1, 1)
  WHERE (Mid(TextFld, (Instr(TextFld, ".") + 1), 1) <> " ");

Save the query, then run it.
 

  Identify oddballs

Oddball records contain data that is unique or difficult to use an algorithm to find and make appropriate corrections.

Remove excess characters -- We'll need to remove the question marks and parentheses in the last record so that we can parse it correctly.  We don't know whether the last name is Long or Zhang, so we'll let the domain experts decide.  For now, we'll remove the troublesome pieces and flag this record so that it can be easily identified later by the domain experts who can make a correction. We can create multiple update queries to remove the question marks or we can use a Replace( ) function recursively. We'll use the latter.

Create a new query and paste the following SQL statement into it:

  UPDATE tblImport
  SET TextFld = Replace(Replace(Replace(TextFld, "?", "", 1, -1),
     " (", "", 1, 1), ") ", "", 1, 1)
  WHERE (Instr(TextFld, "?") > 0);

Save the query, then run it.

Find duplicates -- Part of Mr. Andrada's name is repeated and the initials of Mr. Afan de Ribera are repeated. Fortunately, there are only two records like this and we can manually delete one twin of each repeat and the space delimiter after it.

Miscellaneous -- A few of the people have nicknames, but these nicknames are placed inconsistently within the data. Mr. Scofield has his nickname after his first two initials.  Mr. Andrada has his nickname after his last name, and it's in brackets. For our example, we'll let the domain experts decide on how to handle the nicknames, so we won't concern ourselves here. But they'll need to decide how to handle the nicknames, such as whether they will be stored in a new field in the table or dropped entirely.

 

Transform the data

To transform the data, we'll create a new table and place first names in the FirstName field, middle names in the MiddleName Field, last names in the LastName field, and countries in the Country field. We'll do this in two steps:

    1.)  Create an interim query that marks the delimiters (spaces in our example) between data to differentiate each field we want to separate, organize the simplest records with only three fields in their respective fields (first name, last name, and country), then place the data that needs further transformation into its own field.

    2.)  Create a Make-Table Query based on this interim query that fixes the rest of the data that needs separation into the appropriate fields and replaces the modified "New_Zealand" with the original "New Zealand."
     

  Create the interim query

Mark the delimiter between fields -- If we were using pure VBA, we could identify where each space delimiter was located in the string and save each of these values in separate variables to be used in calculating where in the string each new field begins. In Jet SQL, we can create calculated fields, then use these calculated fields in calculations for other fields, just as if they were variables.  Consider the following SQL statement:

  SELECT InStr(1,TextFld," ") AS Mk1,
   Mid(TextFld,1,(Mk1-1)) AS FName
  FROM tblImport;

To determine the first name of each record, we need to take each character from the beginning of the string to the character prior to the first space. By using the VBA function InStr( ), we can locate the first occurrence of a space in each record. By using this calculated field (aliased as Mk1 for the first mark we are interested in), we can substitute the values passed to the VBA function Mid( ) to calculate which part of the string to use in determining the first name.

To determine the last name of each record, we need to take each character from the first character after the first space in the string to the character prior to the second space.  Consider the following SQL statement:

  SELECT InStr(1,TextFld," ") AS Mk1,
   InStr((Mk1+1),TextFld," ") AS Mk2,
   Mid(TextFld,(Mk1+1),(Mk2-Mk1-1)) AS LName
  FROM tblImport;

This SQL statement calculates where the first and second space delimiters are located in the string, then uses these marks to pass to the Mid( ) function to determine which characters should be used for the last name field, which starts with the first character after Mk1.

We can also determine two other space delimiters that are important for parsing this data: the third space delimiter in the string, which denotes the end of the "simple" type of record (first name, last name, and country), and the space delimiter immediately before the last field in each record, which is the delimiter before the name of the country.  We can do this with the following SQL statement:

  SELECT InStr(1,TextFld," ") AS Mk1,
   InStr((Mk1+1),TextFld," ") AS Mk2,
   Instr((Mk2+1),TextFld," ") AS Mk3,
   InStrRev(TextFld," ",-1) AS Mk4
  FROM tblImport;

Identify the field that needs additional fixes -- We need to fix the records that are not the "simple" type, where there are more than three fields with data (first name, last name, and country).  This third named field will contain the country name if the record is the "simple" type. Otherwise the rest of the string, which may contain multiple last names and the country, will be placed in the third named field. We can easily identify the records which are not the "simple" type, because Mk3 will not be equal to zero, since at least one additional space will be found after the Mk2 mark.  We can place the rest of the string in the FixThis field with the following SQL statement:

  SELECT InStr(1,TextFld," ") AS Mk1,
   InStr((Mk1+1),TextFld," ") AS Mk2,
   Mid(TextFld,(Mk2+1)) AS FixThis
  FROM tblImport;

Put the components together in one SQL statement -- We can combine all of the above SQL statements to include the FName, LName, FixThis named fields, each of the four important deliminator marks, and the original ID and TextFld fields in the following SQL statement:

  SELECT ID, TextFld, InStr(1,TextFld," ") AS Mk1,
   Mid(TextFld,1,(Mk1-1)) AS FName,
   InStr((Mk1+1),TextFld," ") AS Mk2,
   Mid(TextFld,(Mk1+1),(Mk2-Mk1-1)) AS LName,
   Instr((Mk2+1),TextFld," ") AS Mk3,
   Mid(TextFld,(Mk2+1)) AS FixThis,
   InStrRev(TextFld," ",-1) AS Mk4
  FROM tblImport;

Create a new query and paste the above SQL statement into it.  Save this query as qryParseTblImport.
 

  Create the Make-Table Query

Next, we need to create the Make-Table Query based upon this interim query, which will become the table where the cleaned up data is stored.

Identify the first name -- The first field of every record in our interim query can be placed into the FirstName field of the new table without any intervention.

Identify the middle name -- For those records with middle names or middle initials, these need to be moved to their own field, the MiddleName field.  We can determine which records have a middle name or initial, because there will be a space delimiter within the FixThis field between the last name and the country.  We can do this with the following SQL statement:

  SELECT IIf((InStr(1, FixThis, " ") <> 0), LName, Null) AS MiddleName
  FROM qryParseTblImport;

Identify the last name -- For the "simple" type of record, the LName field already has the last name identified.  We can determine which records have the last names in the LName field, because there won't be a space delimiter within the FixThis field between the last name and the country.

However, the other records have the last names in the FixThis field, along with the country. We can determine which records have the last names in the FixThis field, because there will be a space delimiter within the FixThis field between the last name and the country. We can place the last name of either of the two types of records into the LastName field with the following SQL statement:

  SELECT IIf((InStr(1, FixThis, " ") = 0), LName, Left(FixThis, (InStrRev(FixThis," ",-1)-1))) AS LastName
  FROM qryParseTblImport;

Identify the country -- The country is easy to determine because we marked the last space delimiter as Mk4, so every character after that in the TextFld field is the name of the country. We also need to replace the modified "New_Zealand" with the original. We can do both of these tasks with the following SQL statement:

  SELECT Replace(Mid(TextFld, (Mk4 + 1)),"New_Zealand","New Zealand",1,1) AS Country
  FROM qryParseTblImport;

Put the components together in one SQL statement -- We can combine the above SQL statements to create the Make-Table query to insert the data into a new table:

  SELECT ID, FName AS FirstName,
   IIf((InStr(1, FixThis, " ") <> 0),LName,Null) AS MiddleName,
   IIf((InStr(1, FixThis, " ") = 0), LName, Left(FixThis, (InStrRev(FixThis, " ", -1)-1))) AS LastName,
   Replace(Mid(TextFld, (Mk4 + 1)), "New_Zealand", "New Zealand",1,1) AS Country
  INTO tblMVPs
  FROM qryParseTblImport;

Create a new query and paste the above SQL statement into it.  Save this query and run it to create the new table, tblMVPs.  Open the tblMVPs table in Design View and assign the ID field as the primary key. Right-click the title bar of the Design View to open the Table Properties dialog window.  Change the Subdatasheet Name to [None]. Save the table and close the Table Properties dialog window and change to the table's Datasheet View window. You'll notice that each record is neatly parsed into appropriate fields in the tblMVPs table.

Fix the Spanish surnames -- We still need to move the Spanish surnames located in the MiddleName field into the LastName field, but leave the middle initials in the MiddleName field.  We can do this with the following SQL statement:

  UPDATE tblMVPs
  SET LastName = MiddleName & " " & LastName
  WHERE ((Country = "Spain") AND ((Len(MiddleName) > 2) OR (MiddleName = "MĒ")));

Create a new query and paste the above SQL statement into it.  Save this query, then run it.

Now that we've moved the first surname to the LastName field in these records, we need to remove the values in the MiddleName field for each of these records.  Create a new query and paste the following SQL statement into it.

  UPDATE tblMVPs
  SET MiddleName = NULL
  WHERE ((Country = "Spain") AND ((Len(MiddleName) > 2) OR (MiddleName = "MĒ")));

Save this query, then run it.  This completes the cleanup of the records with Spanish surnames.

Other fixes TBD -- We still need to consult our domain experts to confirm how several of the names should be stored, including the issue with the nicknames. However, we can still use the data available to answer our questions:  which country boasts the most Access MVP's and what are the names of these MVP's?

 

Create the queries to answer the questions

We can use several methods to determine the country with the highest number of Access MVP's and the names of these MVP's. We'll use one of the simplest methods for this example.  We can determine the country with the highest number of MVP's with the aggregate function, COUNT( ) in a Top-N query.  We need to ensure that the sort is descending, rather than the default ascending, so that we select the highest number, not the lowest number.  The ORDER BY clause can use either the expression used to calculate a field or the ordinal column number to identify the complete expression to sort on. For this example, we'll use the ordinal column number, because it requires less typing and may show a few peole who have never seen it how to use it.

Create a new query and paste the following SQL statement into it:

  SELECT TOP 1 COUNT(Country) AS NumMVPs, Country
  FROM tblMVPs
  GROUP BY Country
  ORDER BY 1 DESC;

Save this query as qryHighestNumMVPs, then run it.

Notice that this query only returns one record.  It shows that USA has 24 Access MVP's, the highest number of any country.  We can use this query in a subquery in the WHERE clause of another query to determine the names of all of these Access MVP's, sorted by last name, first name, and middle name.

Create a new query and paste the following SQL statement into it:

  SELECT FirstName, MiddleName, LastName, Country
  FROM tblMVPs
  WHERE (Country =
     (SELECT Country
       FROM qryHighestNumMVPs))
  ORDER BY LastName, FirstName, MiddleName;

Save the query, then run it. You should see the following records in your query:

FirstName

MiddleName

LastName

Country

Steve

 

Arbaugh

USA

Dev

 

Ashish

USA

F.

Scott

Barker

USA

Marshall

 

Barton

USA

Jeff

 

Boyce

USA

Steve

 

Clark

USA

Sandra

 

Daigle

USA

Victor

M.

Delgadillo

USA

Tom

 

Ellison

USA

Joseph

E.

Fallon

USA

Dirk

 

Goldgar

USA

Fred

 

Gutkind

USA

Duane

 

Hookom

USA

Larry

 

Linson

USA

Arvin

 

Meyer

USA

Wayne

 

Morgan

USA

Rebecca

M.

Riordan

USA

M.

L.

Sco Scofield

USA

Kenneth

D.

Snell

USA

John

 

Spencer

USA

Bruce

M.

Thompson

USA

Lynn

 

Trapp

USA

John

L.

Viescas

USA

John

Wilmot

Vinson

USA

(As you can see, Mr. Scofield's nickname still needs to be handled after a decision is made on what to do with the nicknames in the table.)

 

Summary

As we've seen, it's fairly easy to come across data that can answer our questions. However, if it remains in its original format, that data may not be as useful as it could be.  Importing that data into a relational database, creating queries to manipulate the raw data into a normalized structure, and making necessary corrections to ensure the data is accurate can increase the value of the data immensely.

I've demonstrated some techniques that can be used to clean up several of the problems commonly found with data gathered from myriad sources, including Web pages and plain text files.  These techniques can help you the next time you find a need to import troublesome data (by preventing data truncation during import), keep multi-worded values together during the transformation into "clean" data, split multivalues into separate fields, and simplify the SELECT clauses that use accumulative string manipulation functions for each subsequent field.

For more articles on data cleanup, please see the Articles table of contents.

(Go to previous page)

 

You can read about your favorite expert contributors.

 

Copyright © 2005 - 2006 Q-Built Solutions. All rights reserved.

 Top

Visitors since 10 Mar. '05: 

 

[MS Access] [Free Stuff] [Articles] [Gem Tips] [How-To Tips] [Links] [Products] [Scorecard] [About Us] [Search]

Sign up for PayPal and start accepting credit card payments instantly.