행위

오라클 텍스트 유사도

DB CAFE

thumb_up 추천메뉴 바로가기


1 UTL_MATCH (텍스트 유사성 검증)[편집]

두개의 컬럼값에 대한 텍스트 유사성을 보여주는 함수입니다.
불일치 : 0, 일치 : 100 그외 각 텍스트의 일치여부를 퍼센트(%)로 보여줍니다.

https://oracle-base.com/articles/11g/utl_match-string-matching-in-oracle


UTL_MATCH : String Matching by Testing Levels of Similarity/Difference Use the UTL_MATCH package to determine the similarity between two strings. oracle-base.com

UTL_MATCH : String Matching by Testing Levels of Similarity/Difference

The UTL_MATCH package was introduced in Oracle 10g Release 2, but first documented (and therefore supported) in Oracle 11g Release 2. It contains a variety of functions that are helpful for testing the level of similarity/difference between strings.

  1. Setup
  2. EDIT_DISTANCE
  3. EDIT_DISTANCE_SIMILARITY
  4. JARO_WINKLER
  5. JARO_WINKLER_SIMILARITY

1.1 Setup[편집]

The examples in this article require the following table definition.

DROP TABLE match_tab;
 
CREATE TABLE match_tab ( id NUMBER, col1 VARCHAR2(15), col2 VARCHAR2(15)
, CONSTRAINT match_tab_pk 
PRIMARY KEY (id) ); 

INSERT INTO match_tab VALUES (1, 'Peter Parker', 'Pete Parker'); 
INSERT INTO match_tab VALUES (2, 'Peter Parker', 'peter parker'); 
INSERT INTO match_tab VALUES (3, 'Clark Kent', 'Claire Kent'); 
INSERT INTO match_tab VALUES (4, 'Wonder Woman', 'Ponder Woman'); 
INSERT INTO match_tab VALUES (5, 'Superman', 'Superman'); 
INSERT INTO match_tab VALUES (6, 'The Hulk', 'Iron Man'); COMMIT;

1.2 EDIT_DISTANCE[편집]

The "Edit Distance", or "Levenshtein Distance", test measures the similarity between two strings by counting the number of character changes (inserts, updates, deletes) required to transform the first string into the second. The number of changes required is know as the distance.

SELECT col1, col2
, UTL_MATCH.edit_distance(col1, col2) AS ed FROM match_tab ORDER BY id;

1.3 EDIT_DISTANCE_SIMILARITY[편집]

The EDIT_DISTANCE_SIMILARITY function uses the same method as the EDIT_DISTANCE function to determine the similarity of the strings, but it returns a normalized result ranging from 0 (no match) to 100 (complete match).

SELECT col1, col2, UTL_MATCH.edit_distance_similarity(col1, col2) AS eds 
FROM match_tab 
ORDER BY id;
SELECT id, col1, col2, UTL_MATCH.edit_distance_similarity(col1, col2) AS eds 
  FROM match_tab 
 WHERE UTL_MATCH.edit_distance_similarity(col1, col2) > 90 
ORDER BY id;

The "Jaro-Winkler Algorithm" provides a different method for finding the distance between two strings.

SELECT col1, col2, UTL_MATCH.jaro_winkler(col1, col2) AS jw 
  FROM match_tab 
 ORDER BY id;

1.4 JARO_WINKLER_SIMILARITY[편집]

​The JARO_WINKLER_SIMILARITY function uses the same method as the JARO_WINKLER function to determine the similarity of the strings, but it returns a normalized result ranging from 0 (no match) to 100 (complete match).

SELECT col1
     , col2
     , UTL_MATCH.jaro_winkler_similarity(col1, col2) AS jws
  FROM match_tab
 ORDER BY id;
SELECT col1
     , col2
     , UTL_MATCH.jaro_winkler_similarity(col1, col2) AS jws
  FROM match_tab
 WHERE UTL_MATCH.jaro_winkler_similarity(col1, col2) > 90
 ORDER BY id;

  • For more information see:

UTL_MATCH : String Matching

UTL_MATCH

Levenshtein distance

Jaro–Winkler distance

1.5 유사도 활용[편집]

SELECT *
  FROM CYKIM.VD_VENDOR_HOTEL_MASTER;

-- 테스트 마스터 테이블 생성
CREATE TABLE CYKIM.XX_HOTEL_MASTER AS
    SELECT T.HOTEL_CODE
          ,T.VENDOR_HOTEL_NAME_EN AS HOTEL_NAME_EN
          ,T.VENDOR_HOTEL_NAME_KO AS HOTEL_NAME_KO
          ,T.PHONE_NO
          ,T.ZIP_CODE
          ,T.ADDRESS_EN
          ,T.ADDRESS_KO
          ,T.LATITUDE
          ,T.LONGITUDE
          ,T.DELETE_YN
      FROM CYKIM.VD_VENDOR_HOTEL_MASTER T
     WHERE T.VENDOR_COMP_CODE = '910001'
       AND T.VENDOR_HOTEL_CODE = T.HOTEL_CODE;

-- 마스터 테이블의 주소에 대한 토큰 테이블 생성 및 인덱싱
CREATE TABLE CYKIM.XX_HOTEL_ADDRESS_TOKENS AS
    SELECT HOTEL_CODE
          ,LVL
          ,TOKEN
      FROM (SELECT S.LVL
                  ,T.HOTEL_CODE
                  ,T.ADDRESS_EN
                  ,REGEXP_SUBSTR(T.ADDRESS_EN, '[^ ]+', 1, S.LVL) AS TOKEN
              FROM CYKIM.XX_HOTEL_MASTER T
             INNER JOIN (SELECT LEVEL AS LVL
                          FROM DUAL
                        CONNECT BY LEVEL < 20) S ON 1 = 1
             ORDER BY T.HOTEL_CODE
                     ,S.LVL) X
     WHERE X.TOKEN IS NOT NULL;

-- 테스트 - API 로부터 받은 데이터
CREATE TABLE CYKIM.XX_HOTEL_FROM_API AS
    SELECT T.VENDOR_COMP_CODE
          ,T.VENDOR_HOTEL_CODE
          ,T.VENDOR_HOTEL_NAME_EN AS HOTEL_NAME_EN
          ,T.VENDOR_HOTEL_NAME_KO AS HOTEL_NAME_KO
          ,T.PHONE_NO
          ,T.ZIP_CODE
          ,T.ADDRESS_EN
          ,T.ADDRESS_KO
          ,T.LATITUDE
          ,T.LONGITUDE
          ,T.DELETE_YN
          ,T.HOTEL_CODE
      FROM CYKIM.VD_VENDOR_HOTEL_MASTER T
     WHERE T.VENDOR_COMP_CODE <> '910001';

-- YKIM.XX_HOTEL_FROM_API 테이블에 HOTEL_CODE_NEW 컬럼 추가함


-- HOTEL_CODE_NEW 컬럼 업데이트 작업.
DECLARE

    CURSOR C_MAIN IS
        SELECT *
          FROM CYKIM.XX_HOTEL_FROM_API T;

BEGIN

    FOR R_MAIN IN C_MAIN
    LOOP
    
        DECLARE
            V_HOTEL_CODE VARCHAR2(10);
        BEGIN
            DBMS_OUTPUT.PUT_LINE(R_MAIN.VENDOR_COMP_CODE || ':' || R_MAIN.VENDOR_HOTEL_CODE || ':' || R_MAIN.HOTEL_NAME_EN);
            WITH SOURCE_HOTEL_TOKEN AS
             (SELECT X.*
                FROM (SELECT T.*
                            ,REGEXP_SUBSTR(T.ADDRESS_EN, '[^ ]+', 1, S.LVL) AS TOKEN
                        FROM CYKIM.XX_HOTEL_FROM_API T
                       INNER JOIN (SELECT LEVEL AS LVL
                                    FROM DUAL
                                  CONNECT BY LEVEL < 20) S ON 1 = 1
                       WHERE T.VENDOR_COMP_CODE = R_MAIN.VENDOR_COMP_CODE
                         AND T.VENDOR_HOTEL_CODE = R_MAIN.VENDOR_HOTEL_CODE) X
               WHERE X.TOKEN IS NOT NULL)
            SELECT Y.HOTEL_CODE
              INTO V_HOTEL_CODE
              FROM (SELECT TKN.HOTEL_CODE
                          ,COUNT(*) AS CNT
                      FROM SOURCE_HOTEL_TOKEN ST
                     INNER JOIN CYKIM.XX_HOTEL_ADDRESS_TOKENS TKN ON (TKN.TOKEN = ST.TOKEN)
                     GROUP BY TKN.HOTEL_CODE
                     HAVING COUNT(*) > 1
                     ORDER BY CNT DESC) Y
             WHERE ROWNUM = 1;
        
            UPDATE CYKIM.XX_HOTEL_FROM_API T
               SET T.HOTEL_CODE_NEW = V_HOTEL_CODE
             WHERE T.VENDOR_COMP_CODE = R_MAIN.VENDOR_COMP_CODE
               AND T.VENDOR_HOTEL_CODE = R_MAIN.VENDOR_HOTEL_CODE;
            
            COMMIT;
            
        EXCEPTION
            WHEN NO_DATA_FOUND THEN
                DBMS_OUTPUT.PUT_LINE(SQLERRM);
        END;
    
    END LOOP;

END;