需求
例如一张A表(id, userid, info);其中id为自增长主键。存在userid重复记录;并一直有新数据增加,现需要根据userid去重并实时插入表B。
用SQL语法来解释:
- -- 获取表B中记录最大id
- select max(id) from B
- -- 根据上面获取的id;提取最新的A表记录;根据userid去重;并插入中间表TMP_B;
- INSERT INTO TMP_B
- SELECT id, userid, info
- FROM (select id,
- userid,
- info,
- ROW_NUMBER() OVER(PARTITION BY userid ORDER BY id) RN
- FROM A
- WHERE id > ?)
- WHERE RN = 1;
- -- 把上面获取的记录;不存在则插入表B
- insert into B
- select id, userid, info
- from TMP_B
- where userid not in (select userid from B);
-
- -- 或者用not exists
- insert into B
- select id, userid, info
- from TMP_B tb
- where not exists (select 1 from B b where b.userid = tb.userid);
其中需要改写的语句是
- select id, userid, info
- from TMP_B
- where userid not in (select userid from B);
-- 插入实验数据
-- 在表B插入百万条记录
insert into B select level, dbms_random.string('x',8), 'lottu'||level from dual connect by level <= 1000000;
-- 在表TMP_B插入十万条记录
insert into TMP_B select level, dbms_random.string('x',8), 'lottu'||level from dual connect by level <= 100000;
通过执行计划;会发现not in/ not exists 效率相差不大。
- insert into B
- select id, userid, info from TMP_B tb
- where not exists (select 1 from B b where b.userid = tb.userid);
- 100000 rows created.
- Elapsed: 00:00:00.52
- Execution Plan
- ----------------------------------------------------------
- Plan hash value: 3462170537
- ------------------------------------------------------------------------------------------
- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)|Time |
- ------------------------------------------------------------------------------------------
- | 0 | INSERT STATEMENT | | 115K| 4389K| | 2994 (1)|00:00:36 |
- | 1 | LOAD TABLE CONVENTIONAL | B | | | | | |
- |* 2 | HASH JOIN ANTI | | 115K| 4389K| 4960K| 2994 (1)|00:00:36 |
- | 3 | TABLE ACCESS FULL | TMP_B | 115K| 3601K| | 137 (1)|00:00:02 |
- | 4 | TABLE ACCESS FULL | B | 1375K| 9403K| | 1372 (1)|00:00:17 |
- -----------------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 2 - access("B"."USERID"="TB"."USERID")
- Note
- -----
- - dynamic sampling used for this statement (level=2)
- Statistics
- ----------------------------------------------------------
- 0 recursive calls
- 3221 db block gets
- 6320 consistent gets
- 35 physical reads
- 3649980 redo size
- 845 bytes sent via SQL*Net to client
- 866 bytes received via SQL*Net from client
- 3 SQL*Net roundtrips to/from client
- 1 sorts (memory)
- 0 sorts (disk)
- 100000 rows processed
替换1: 通过merge into改写语句
- merge into B b
- using TMP_B tb
- on (tb.userid = b.userid)
- WHEN NOT MATCHED THEN
- INSERT (b.id, b.userid, b.info) values (tb.id, tb.userid, tb.info);
通过执行计划;该SQL会占用内存用于排序。效率会有大大的提升。 特别是在B数据量很大的情况。优势更明显。
- Elapsed: 00:00:00.18
- Execution Plan
- ----------------------------------------------------------
- Plan hash value: 2722554344
- --------------------------------------------------------------------------------------
-
- | Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time|
- --------------------------------------------------------------------------------------
- | 0 | MERGE STATEMENT | | 115K| 7203K| | 5298 (1)| 00:01:04 |
- | 1 | MERGE | B | | | | ||
- | 2 | VIEW | | | | | ||
- |* 3 | HASH JOIN OUTER | | 115K| 8553K| 4960K| 5298 (1)| 00:01:04 |
- | 4 | TABLE ACCESS FULL| TMP_B | 115K| 3601K| | 137 (1)| 00:00:02 |
- | 5 | TABLE ACCESS FULL| B | 1336K| 56M| | 1373 (1)| 00:00:17 |
- --------------------------------------------------------------------------------------
- Predicate Information (identified by operation id):
- ---------------------------------------------------
- 3 - access("TB"."USERID"="B"."USERID"(+))
- Note
- -----
- - dynamic sampling used for this statement (level=2)
- Statistics
- ----------------------------------------------------------
- 0 recursive calls
- 0 db block gets
- 5496 consistent gets
- 0 physical reads
- 0 redo size
- 844 bytes sent via SQL*Net to client
- 896 bytes received via SQL*Net from client
- 3 SQL*Net roundtrips to/from client
- 1 sorts (memory)
- 0 sorts (disk)
- 0 rows processed
替换2: 通过kettle工具,设置转换;通过shell命令;用cron或者其他调度系统调用。该操作相对用户是透明的;至于效率方面有待验证。

最后该需求若在PostgreSQL中;有更简洁的的写法.简单粗暴。
- INSERT INTO B
- SELECT *
- FROM A on conflict (userid) do nothing;