As any Hadoop systems, Hive is a read-only system it only allows insert and insert overwrite the entire table or partitions. Our backend system is MS SQL we use Hive to mash the data and move the data to Elasticsearch and ES serves the analytics requests. But when rows are updated in SQL we have to pass the update all the way to ES, this means we have to update the hive data. This post describes how we achieved pseudo updates in Hive. To start with the SQL tables for example is
Id Value 1 "hi" 2 "hellp"
We sqooped the data into Hive tables. And now we updated the hellp to hello in SQL. To update the same in Hive we have to delete the hive table/partition and re-insert the data in hive.
To solve this problem we started by adding a modified date column to the SQL table which is updated by a SQL trigger. The incremental sqoop job appends the modified rows to the Hive table. Now the hive table has both hello and hellp. Hello with a more recent modified date.
We started by group the rows by id and sorting them by modified date
select id, value, modifieddatetime from table1 distribute by id sort by id, modifieddatetime desc
Next, we start by ranking all the rows using a Hive UDF.
select id, value, modifieddatetime, rank(id) krank from( select id, value, modifieddatetime from table distribute by id sort by id, modifieddatetime desc) t1)
select id from (select id, value, modifieddatetime, rank(id) krank from( select id, value, modifieddatetime from table distribute by id sort by id, modifieddatetime desc) t1) t2 where t2.krank = 1 group by id;
To insert the data into Elasticsearch, data is inserted into Hive external tables with elasticsearch storage handler. And the updated data is passed along to ES. We could use the same to save the data to a new partition or run MR jobs.