With the release of Apache Spark V1.5, there are three string-trimming functions in Spark SQL: TRIM, LTRIM and RTRIM. These three trim functions can trim the spaces from a string, although the user can’t specify a character or a character string for the trim function to trim.

It’s extremely useful functionality for the TRIM function, as outlined in the merged Apache Spark issue JIRA-14878 in Spark 2.2.

The use of the new enhancement to the TRIM function is outlined below. Following is the syntax of the new feature in the Spark SQL interface.

TRIM

TRIM ([LEADING | TRAILING | BOTH] [trim characters] FROM )

  • trim characters: The characters to trim off the source string, the default value is a single space.
  • BOTH: Keyword to specify the trimming characters from both ends of the source string.
  • LEADING: Keyword to specify the trimming characters from the left end of the source string.
  • TRAILING: Keyword to specify the trimming characters from the right end of the source string.
  • trim source: A string expression.

Sample Use case: Use the TRIM function with BOTH keywords

scala>   spark.sql("select   TRIM(BOTH   'abc'   FROM   'aaaddddcccc')").show 
+----------------------+
|trim(aaaddddcccc, abc)|
+----------------------+
|                  dddd|
+----------------------+ 

Sample use case: Use the TRIM function with LEADING keywords

scala>   spark.sql("select   TRIM(LEADING   'abc'   FROM   'aaaaabbbddddcccc')").show 
+----------------------------+
|ltrim(aaaaabbbddddcccc, abc)|
+----------------------------+
|                    ddddcccc|
+----------------------------+ 

Sample use case: Use the TRIM function with TRAILING keywords

scala>   spark.sql("select   TRIM(TRAILING   'abc'   FROM   'aaaaabbbddddcccc')").show 
+----------------------------+
|rtrim(aaaaabbbddddcccc, abc)|
+----------------------------+
|                aaaaabbbdddd|
+----------------------------+ 

Sample use case: Use the TRIM function to trim multiple-bytes characters with BOTH keywords

Note: In this use case, one Chinese character requires 3 bytes in UTF-8 encoding.

sql("select   TRIM(BOTH   '砖头数数'   FROM   '数数数据砖头数数’)”).show 
+————————————----------------—-+ 
|trim(数数数据砖头数数,   砖头数数)| 
+------------------------------+ 
|                            据|
+------------------------------+ 

Sample use case: Use the TRIM function to trim multiple-byte characters with LEADING keywords

sql("select   TRIM(LEADING   '砖头数数'   FROM   '数数a数据砖头数数')").show 
+------------------------------+ 
|ltrim(数数a数据砖头数数, 砖头数数)| 
+------------------------------+ 
|                   a数据砖头数数| 
+------------------------------+ 

Sample use case: Use the TRIM function to trim multiple-bytes characters with TRAILING keywords

sql("select   TRIM(TRAILING   '砖头数数'   FROM   '数数a数据砖   头数数')").show 
+---------------------------------+
|rtrim(数数a数据砖  头数数,  砖头数数)| 
+---------------------------------+ 
|                        数数a数据砖| 
+---------------------------------+ 

Sample use case: Use the TRIM function to trim the column

scala>   spark.sql("select   *   from   t2").show 
+------------------+
|               str| 
+------------------+ 
|  ex      &[]ample| 
|           example| 
+------------------+ 
 
scala>   spark.sql("select   TRIM(BOTH   'xe   ]&[   $'   FROM   str)   from   t2").show 
+-------------------------+ 
|trim(str,   xe   ]&[   $)| 
+-------------------------+ 
|                     ampl| 
|                     ampl| 
+-------------------------+ 

LTRIM

The LTRIM function trims the characters from the left end of a given string.

LTRIM ([trim characters], trim source)

  • trim characters: The trim string characters to trim, the default value is a single space
  • trim source: A string expression

Note: In the Spark SQL interface, the first parameter is trim characters, the second is the trim source. If you use this LTRIM in the data set API, the trim source will be the first parameter.

Sample use case: Use the LTRIM function in the SQL interface to remove numbers, special characters from the left end of the source string

scala>   spark.sql("select   LTRIM('321',   '123DEFG123'),   LTRIM('<{[   ]$',   '   $${{]]] 12DEFG123')").show
+-------------------------------+-------------------------------------------+
|       ltrim(123DEFG123,   321)|  ltrim(   $${{]]]   12DEFG123,   <{[   ]$)| 
+-------------------------------+-------------------------------------------+
|                        DEFG123|                                  12DEFG123| 
+-------------------------------+-------------------------------------------+ 

Sample use case: Use the LTRIM function to remove multiple-byte characters

Note: In this case, one Chinese character needs 3 bytes in UTF-8 encoding.

sql("select   ltrim('砖头a数数   砖   据',   '数数a数据砖   $头数数'   )").show 
+-----------------------------------------------+ 
| ltrim(数数a数据砖   $头数数,   砖头a数数   砖   据)| 
+-----------------------------------------------+ 
|                                         $头数数| 
+-----------------------------------------------+
 
scala>   spark.sql("select   LTRIM('x   e&[',   str)   FROM   t2").show 
+---------------------+
|ltrim(str,   x   e&[)|
+---------------------+ 
|               ]ample| 
|                ample| 
+---------------------+ 

RTRIM

A function that trims the characters from right end of a given string.

RTRIM ([trim characters], trim source)

  • trim characters: The trim string characters to trim, the default value is a single space
  • trim source: A string expression

Note: trim source is the second parameter if there are trim characters. If it is used in the dataset API, trim source will be the first parameter.

Sample use case: Use the RTRIM function in the sql interface to remove numbers, special characters from the right end of the source string

scala>   spark.sql("select   RTRIM('321',   '123DEFG123'),   RTRIM('<{[   ]$',   '   $${{]]]   12DEFG123$$ [[[]]]<')").show
+--------------------------------+------------------------------------------------------+
|        rtrim(123DEFG123,   321)| rtrim(   $${{]]]   12DEFG123$$   [[[]]]<,   <{[   ]$)| 
+--------------------------------+------------------------------------------------------+ 
|                         123DEFG|                                   $${{]]]   12DEFG123| 
+--------------------------------+------------------------------------------------------+ 

Sample use case: Use the RTRIM function to remove multiple-byte characters

Note: In this case, one Chinese character needs 3 bytes in UTF-8 encoding.

sql("select   RTRIM('砖头a数数   砖   据',   '数数a数据砖   $头数数'   )").show 
+-----------------------------------------------+ 
|rtrim(数数a数据砖   $头数数,   砖头a数数   砖   据)| 
+-----------------------------------------------+ 
|                                数数a数据砖    $| 
+-----------------------------------------------+ 

Sample use case: Use the RTRIM function to remove trim characters from a column

scala>   spark.sql("select   *   from   t2").show
+-------------------+
|                str| 
+-------------------+ 
|   ex      &[]ample| 
|            example| 
+-------------------+ 
 
 scala>   spark.sql("select   RTRIM('x   e&l数p',   str)   FROM   t2").show 
+------------------------+
|rtrim(str,   x   e&l数p)|
+------------------------+ 
|            ex     &[]am| 
|                    exam| 
+------------------------+ 

Here are the examples for using the TRIM functions in the dataset interface.

Note: The order of parameters for the TRIM function in the dataset API is slightly different. The first parameter is the trim source string and the second parameter is the trim characters.

scala>   df.show 
+---------------+-------------------+------------------------------------+ 
|              a|                  b|                                   c| 
+---------------+-------------------+------------------------------------+ 
|        example|    <>$&*e   xample|             <&$   example数???&&&$%| 
+---------------+-------------------+------------------------------------+ 
 
scala>   df.select(TRIM($"a",   "exl"),   LTRIM($"b",   "*   e><   &$x"),   RTRIM($"c",   "数?e$%")).show 
+--------------+-------------------------+----------------------------------+ 
|trim(a,   exl)|ltrim(b,   *   e><   &$x)|               rtrim(c,   数?&e$%)| 
+--------------+-------------------------+----------------------------------+ 
|           amp|                    ample|                       <&$  exampl|
+--------------+-------------------------+----------------------------------+ 

With these enhancements, the TRIM function is more powerful than ever for Apache Spark applications. Don't hesitate to leave me your feedback or questions.

Join The Discussion

Your email address will not be published. Required fields are marked *