오라클 텍스트 유사도
DB CAFE
- DBA { Oracle DBA 명령어 > DBA 초급 과정 > DBA 고급 과정 }
- 튜닝 { 오라클 튜닝 목록 }
- 모델링 { 데이터 모델링 가이드 }
목차
1 UTL_MATCH (텍스트 유사성 검증)[편집]
- 두개의 컬럼값에 대한 텍스트 유사성을 보여주는 함수입니다.
- 불일치 : 0, 일치 : 100 그외 각 텍스트의 일치여부를 퍼센트(%)로 보여줍니다.
https://oracle-base.com/articles/11g/utl_match-string-matching-in-oracle
UTL_MATCH : String Matching by Testing Levels of Similarity/Difference
Use the UTL_MATCH package to determine the similarity between two strings.
oracle-base.com
UTL_MATCH : String Matching by Testing Levels of Similarity/Difference
The UTL_MATCH package was introduced in Oracle 10g Release 2, but first documented (and therefore supported) in Oracle 11g Release 2. It contains a variety of functions that are helpful for testing the level of similarity/difference between strings.
- Setup
- EDIT_DISTANCE
- EDIT_DISTANCE_SIMILARITY
- JARO_WINKLER
- JARO_WINKLER_SIMILARITY
1.1 Setup[편집]
The examples in this article require the following table definition.
DROP TABLE match_tab;
CREATE TABLE match_tab ( id NUMBER, col1 VARCHAR2(15), col2 VARCHAR2(15)
, CONSTRAINT match_tab_pk
PRIMARY KEY (id) );
INSERT INTO match_tab VALUES (1, 'Peter Parker', 'Pete Parker');
INSERT INTO match_tab VALUES (2, 'Peter Parker', 'peter parker');
INSERT INTO match_tab VALUES (3, 'Clark Kent', 'Claire Kent');
INSERT INTO match_tab VALUES (4, 'Wonder Woman', 'Ponder Woman');
INSERT INTO match_tab VALUES (5, 'Superman', 'Superman');
INSERT INTO match_tab VALUES (6, 'The Hulk', 'Iron Man'); COMMIT;
1.2 EDIT_DISTANCE[편집]
The "Edit Distance", or "Levenshtein Distance", test measures the similarity between two strings by counting the number of character changes (inserts, updates, deletes) required to transform the first string into the second. The number of changes required is know as the distance.
SELECT col1, col2
, UTL_MATCH.edit_distance(col1, col2) AS ed FROM match_tab ORDER BY id;
1.3 EDIT_DISTANCE_SIMILARITY[편집]
The EDIT_DISTANCE_SIMILARITY function uses the same method as the EDIT_DISTANCE function to determine the similarity of the strings, but it returns a normalized result ranging from 0 (no match) to 100 (complete match).
SELECT col1, col2, UTL_MATCH.edit_distance_similarity(col1, col2) AS eds
FROM match_tab
ORDER BY id;
SELECT id, col1, col2, UTL_MATCH.edit_distance_similarity(col1, col2) AS eds
FROM match_tab
WHERE UTL_MATCH.edit_distance_similarity(col1, col2) > 90
ORDER BY id;
The "Jaro-Winkler Algorithm" provides a different method for finding the distance between two strings.
SELECT col1, col2, UTL_MATCH.jaro_winkler(col1, col2) AS jw
FROM match_tab
ORDER BY id;
1.4 JARO_WINKLER_SIMILARITY[편집]
The JARO_WINKLER_SIMILARITY function uses the same method as the JARO_WINKLER function to determine the similarity of the strings, but it returns a normalized result ranging from 0 (no match) to 100 (complete match).
SELECT col1
, col2
, UTL_MATCH.jaro_winkler_similarity(col1, col2) AS jws
FROM match_tab
ORDER BY id;
SELECT col1
, col2
, UTL_MATCH.jaro_winkler_similarity(col1, col2) AS jws
FROM match_tab
WHERE UTL_MATCH.jaro_winkler_similarity(col1, col2) > 90
ORDER BY id;
- For more information see:
UTL_MATCH : String Matching
UTL_MATCH
Levenshtein distance
Jaro–Winkler distance
1.5 유사도 활용[편집]
SELECT *
FROM CYKIM.VD_VENDOR_HOTEL_MASTER;
-- 테스트 마스터 테이블 생성
CREATE TABLE CYKIM.XX_HOTEL_MASTER AS
SELECT T.HOTEL_CODE
,T.VENDOR_HOTEL_NAME_EN AS HOTEL_NAME_EN
,T.VENDOR_HOTEL_NAME_KO AS HOTEL_NAME_KO
,T.PHONE_NO
,T.ZIP_CODE
,T.ADDRESS_EN
,T.ADDRESS_KO
,T.LATITUDE
,T.LONGITUDE
,T.DELETE_YN
FROM CYKIM.VD_VENDOR_HOTEL_MASTER T
WHERE T.VENDOR_COMP_CODE = '910001'
AND T.VENDOR_HOTEL_CODE = T.HOTEL_CODE;
-- 마스터 테이블의 주소에 대한 토큰 테이블 생성 및 인덱싱
CREATE TABLE CYKIM.XX_HOTEL_ADDRESS_TOKENS AS
SELECT HOTEL_CODE
,LVL
,TOKEN
FROM (SELECT S.LVL
,T.HOTEL_CODE
,T.ADDRESS_EN
,REGEXP_SUBSTR(T.ADDRESS_EN, '[^ ]+', 1, S.LVL) AS TOKEN
FROM CYKIM.XX_HOTEL_MASTER T
INNER JOIN (SELECT LEVEL AS LVL
FROM DUAL
CONNECT BY LEVEL < 20) S ON 1 = 1
ORDER BY T.HOTEL_CODE
,S.LVL) X
WHERE X.TOKEN IS NOT NULL;
-- 테스트 - API 로부터 받은 데이터
CREATE TABLE CYKIM.XX_HOTEL_FROM_API AS
SELECT T.VENDOR_COMP_CODE
,T.VENDOR_HOTEL_CODE
,T.VENDOR_HOTEL_NAME_EN AS HOTEL_NAME_EN
,T.VENDOR_HOTEL_NAME_KO AS HOTEL_NAME_KO
,T.PHONE_NO
,T.ZIP_CODE
,T.ADDRESS_EN
,T.ADDRESS_KO
,T.LATITUDE
,T.LONGITUDE
,T.DELETE_YN
,T.HOTEL_CODE
FROM CYKIM.VD_VENDOR_HOTEL_MASTER T
WHERE T.VENDOR_COMP_CODE <> '910001';
-- YKIM.XX_HOTEL_FROM_API 테이블에 HOTEL_CODE_NEW 컬럼 추가함
-- HOTEL_CODE_NEW 컬럼 업데이트 작업.
DECLARE
CURSOR C_MAIN IS
SELECT *
FROM CYKIM.XX_HOTEL_FROM_API T;
BEGIN
FOR R_MAIN IN C_MAIN
LOOP
DECLARE
V_HOTEL_CODE VARCHAR2(10);
BEGIN
DBMS_OUTPUT.PUT_LINE(R_MAIN.VENDOR_COMP_CODE || ':' || R_MAIN.VENDOR_HOTEL_CODE || ':' || R_MAIN.HOTEL_NAME_EN);
WITH SOURCE_HOTEL_TOKEN AS
(SELECT X.*
FROM (SELECT T.*
,REGEXP_SUBSTR(T.ADDRESS_EN, '[^ ]+', 1, S.LVL) AS TOKEN
FROM CYKIM.XX_HOTEL_FROM_API T
INNER JOIN (SELECT LEVEL AS LVL
FROM DUAL
CONNECT BY LEVEL < 20) S ON 1 = 1
WHERE T.VENDOR_COMP_CODE = R_MAIN.VENDOR_COMP_CODE
AND T.VENDOR_HOTEL_CODE = R_MAIN.VENDOR_HOTEL_CODE) X
WHERE X.TOKEN IS NOT NULL)
SELECT Y.HOTEL_CODE
INTO V_HOTEL_CODE
FROM (SELECT TKN.HOTEL_CODE
,COUNT(*) AS CNT
FROM SOURCE_HOTEL_TOKEN ST
INNER JOIN CYKIM.XX_HOTEL_ADDRESS_TOKENS TKN ON (TKN.TOKEN = ST.TOKEN)
GROUP BY TKN.HOTEL_CODE
HAVING COUNT(*) > 1
ORDER BY CNT DESC) Y
WHERE ROWNUM = 1;
UPDATE CYKIM.XX_HOTEL_FROM_API T
SET T.HOTEL_CODE_NEW = V_HOTEL_CODE
WHERE T.VENDOR_COMP_CODE = R_MAIN.VENDOR_COMP_CODE
AND T.VENDOR_HOTEL_CODE = R_MAIN.VENDOR_HOTEL_CODE;
COMMIT;
EXCEPTION
WHEN NO_DATA_FOUND THEN
DBMS_OUTPUT.PUT_LINE(SQLERRM);
END;
END LOOP;
END;