Update operation in Hive (pseudo)

As any Hadoop systems, Hive is a read-only system it only allows insert and insert overwrite the entire table or partitions. Our backend system is MS SQL we use Hive to mash the data and move the data to Elasticsearch and ES serves the analytics requests. But when rows are updated in SQL we have to pass the update all the way to ES, this means we have to update the hive data. This post describes how we achieved pseudo updates in Hive. To start with the SQL tables for example is

Id  Value
1     "hi"
2     "hellp"

We sqooped the data into Hive tables. And now we updated the hellp to hello in SQL. To update the same in Hive we have to delete the hive table/partition and re-insert the data in hive.

To solve this problem we started by adding a modified date column to the SQL table which is updated by a SQL trigger. The incremental sqoop job appends the modified rows to the Hive table. Now the hive table has both hello and hellp. Hello with a more recent modified date.

We started by group the rows by id and sorting them by modified date

select id, value, modifieddatetime
   from table1
      distribute by id 
      sort by id, modifieddatetime desc

Next, we start by ranking all the rows using a Hive UDF.

select
  id, value, modifieddatetime,
   rank(id) krank
from(
select id, value, modifieddatetime
   from table
      distribute by id
      sort by id, modifieddatetime desc) t1)

The Rank UDF can be downloaded from Github. To add the UDF permanently to Hive follow this post. And we select the most updated row by using where rank = 1

select id
from
(select
  id, value, modifieddatetime,
   rank(id) krank
from(
select id, value, modifieddatetime
   from table
      distribute by id
      sort by id, modifieddatetime desc) t1) t2 
where t2.krank = 1 group by id;

To insert the data into Elasticsearch, data is inserted into Hive external tables with elasticsearch storage handler. And the updated data is passed along to ES. We could use the same to save the data to a new partition or run MR jobs.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s