Avoiding data processing with hashes

Recently for a project I am working on we had to find a way to decide if a data record had changed since the last time we saved it to the database. For this project we are receiving data from a number of customer databases periodically (every 6 hours) and need to do some ETL stuff on it before storing it in an Oracle database. We all know that reading a record from Oracle can be slow (relatively speaking), so we wanted something much faster. Since there isn't a reasonable way to determine from the databases themselves which records have changed we are forced to dump entire tables for processing. We wanted to do the minimal amount of processing necessary if the records haven't changed since we wrote them to the database to save processing time. Typically for most of these tables only a very small number of rows have actually changed. The solution we came up with was to calculate a hash value from the concatenated column values for each record when we write a record to Oracle. These hash values are then stored in a Redis hash keyed by the primary key of the table. Then when we receive a data table dump from the remote database we calculate the same hash on the incoming record and compare it to the hash value we have stored in Redis. If they match we know the record hasn't changed and we don't need to process it. The whole thing is very fast and saves a great deal of unnecessary processing downstream. So far Redis has proven up to the task of quickly storing and retrieving millions of key/hash values. Here's what it looks like: Let's say we have a customer record in our database that has id, name, address, city and state like this: id: 1000, name: Joe Smith, address: 123 Main St., city: New York, state: NY the first time we see this record and check Redis for the existence of the key 1000 we will get nil returned, so we store the record in the database, calculate the hash of the value "1000:Joe Smith:123 Main St.:New York:NY" to get something like "0a243fec5ab" (we are using 128-bit hash values in production) and then store that hash value in Redis using set "1000" "0a243fec5ab" the next time we get this record we calculate the same hash if it is unchanged and know we don't need to process it again. If the hash value is different then we know that the record has changed and we need to send it through the rest of the processing chain. We are currently using Storm to process millions of records at a time and use this hash check very early on in the pipeline to save us from needing more resources later in the pipeline to process unchanged records. We chose to use the Murmur-128 hash that is included in the Google Guava library for its speed. Have you ever had a similar requirement? How did you solve it?

blog comments powered by Disqus

Published

16 May 2012

Avoiding data processing with hashes Supporting tagline

Published