Big SQL’s LOAD HADOOP statement supports a number of options. Among the options are those that describe how to interpret the data contained in a source file. These options are referred to as source properties. Note that each source property and its associated value must be enclosed in single quotes when specified in the LOAD HADOOP statement.

Two of the source properties, quote.char and escape.char, will be discussed here.

‘quote.char’=‘char’

Double quotation marks (“…”) are, by default, expected to be used to enclose a text field value in a source file that contains the field delimiter character. For example, let’s say we have a field value of January,February,March. If a comma (,) is used as the field delimiter, then this string value must be enclosed in double quotation marks as follows: “January, February, March”.

If there are field values in the source file that include double quotation marks as well as field delimiter characters, then a different character would need to be used to enclose these field values. The source property, quote.char, can be used to override the use of double quotation marks.

For example, let’s say we have the following table:

CREATE HADOOP TABLE employees
(
   empno INTEGER,
   fullname VARCHAR(30),
   dept INTEGER
)

Source file /my_hdfs/data/staff_data1.csv contains the data below. A comma (,) is used as the field delimiter.

(*** empno, fullname, dept ***)
100,'Plotz, James "Jim"',850
110,'Ngan, John',750
120,'Naughton, Rebecca "Becky"',650
130,'Yamaguchi, Kaito',750
140,'Fraye, Charles',650
150,'Williams, Katherine "Kathy"',750
160,'Molinare, Mary',550
170,'Kermisch, Joanne',650
180,'Abrahams, Stephen',850
190,'Sneider, William "Bill"',650
200,'Sanders, Allison',550

Note that the values for the 2nd field include a comma and some of them also include a double-quoted string. As indicated earlier, double quotation marks are, by default, expected to be used to enclose a field value that includes the field delimiter character. The data above has instead used single quotation marks.

The following LOAD HADOOP statement would be used to load the data into the table:

LOAD HADOOP USING FILE URL '/my_hdfs/data/staff_data1.csv'
WITH SOURCE PROPERTIES ('field.delimiter'=',','quote.char'='''')
INTO TABLE employees

Note that since the new value for quote.char is a single quote that needs to be enclosed by single quotes, the embedded single quote must be escaped by another single quote (four single quote characters in total).

By querying the table after the load has completed, we can verify that the data was loaded correctly:

SELECT empno, fullname, dept FROM employees ORDER BY empno

EMPNO       FULLNAME                       DEPT       
----------- ------------------------------ -----------
        100 Plotz, James "Jim"                     850
        110 Ngan, John                             750
        120 Naughton, Rebecca "Becky"              650
        130 Yamaguchi, Kaito                       750
        140 Fraye, Charles                         650
        150 Williams, Katherine "Kathy"            750
        160 Molinare, Mary                         550
        170 Kermisch, Joanne                       650
        180 Abrahams, Stephen                      850
        190 Sneider, William "Bill"                650
        200 Sanders, Allison                       550

  11 record(s) selected.

‘escape.char’=‘char’

An alternative to using source property quote.char is to instead use source property escape.char. This can be used to specify an escape character that would allow a double quotation mark to be part of a field value that is enclosed by double quotation marks.

For example, let’s say we have the same table as in the previous example:

CREATE HADOOP TABLE employees
(
   empno INTEGER,
   fullname VARCHAR(30),
   dept INTEGER
)

Source file /my_hdfs/data/staff_data2.csv contains the data below. A comma (,) is used as the field delimiter.

(*** empno, fullname, dept ***)
100,"Plotz, James !"Jim!"",850
110,"Ngan, John",750
120,"Naughton, Rebecca !"Becky!"",650
130,"Yamaguchi, Kaito",750
140,"Fraye, Charles",650
150,"Williams, Katherine !"Kathy!"",750
160,"Molinare, Mary",550
170,"Kermisch, Joanne",650
180,"Abrahams, Stephen",850
190,"Sneider, William !"Bill!"",650
200,"Sanders, Allison",550

Note that the values for the 2nd field include a comma and some of them also include a double-quoted string. The double-quoted string within the double-quoted field value can be recognized by using the escape character ! in front of each embedded double quotation mark.

The following LOAD HADOOP statement would be used to load the data into the table:

LOAD HADOOP USING FILE URL '/my_hdfs/data/staff_data2.csv'
WITH SOURCE PROPERTIES ('field.delimiter'=',','escape.char'='!')
INTO TABLE employees

By querying the table after the load has completed, we can verify that the data was loaded correctly:

SELECT empno, fullname, dept FROM employees ORDER BY empno

EMPNO       FULLNAME                       DEPT       
----------- ------------------------------ -----------
        100 Plotz, James "Jim"                     850
        110 Ngan, John                             750
        120 Naughton, Rebecca "Becky"              650
        130 Yamaguchi, Kaito                       750
        140 Fraye, Charles                         650
        150 Williams, Katherine "Kathy"            750
        160 Molinare, Mary                         550
        170 Kermisch, Joanne                       650
        180 Abrahams, Stephen                      850
        190 Sneider, William "Bill"                650
        200 Sanders, Allison                       550

  11 record(s) selected.

Join The Discussion

Your email address will not be published. Required fields are marked *