1. ๋„์ž…(Introduction)

๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„์„ํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ์ž ๋กœ์ปฌํ™˜๊ฒฝ์— ํ•˜๋‘ก(Hadoop) ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ตฌ์„ฑํ•˜๊ณ  ์—ฌ๊ธฐ์— ์ŠคํŒŒํฌ(Spark)๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋Œ€์šฉ๋Ÿ‰ ๋กœ๊ทธ ๋ฐ์ดํ„ฐ ๋“ฑ์„ ๋ถ„์„ํ•˜๋Š”๋ฐ ํ™œ์šฉํ•˜๊ธฐ๋„ ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ๋•Œ๋กœ๋Š” ์‹ค์ œ ๋ถ„์„์— ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„๋ณด๋‹ค ํ•˜๋‘ก ํ™˜๊ฒฝ์„ ๊ตฌ์ถ•ํ•˜๋Š”๋ฐ ๊ฑธ๋ฆฌ๋Š” ์‹œ๊ฐ„๊ณผ ์ŠคํŒŒํฌ ๋ถ„์„ ํ™˜๊ฒฝ์„ ๋งˆ๋ จํ•˜๋Š” ๋ฐ์— ์‹œ๊ฐ„์„ ๋” ๋งŽ์ด ๊ฑธ๋ฆฌ๊ธฐ๋„ ํ•˜์˜€์ฃ . ํ•˜์ง€๋งŒ, ์ง€๊ธˆ์€ IBM Cloud ํ™˜๊ฒฝ์—์„œ ์†์‰ฝ๊ฒŒ ๋ถ„์„ ํ™˜๊ฒฝ์„ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์žˆ๊ณ , ๊ตฌ์ถ•๋œ ๋ถ„์„ ํ™˜๊ฒฝ๊ณผ ์Šคํฌ๋ฆฝํŠธ๋Š” ์›ํ•˜๋Š” ์‹œ๊ฐ„ ๋™์•ˆ ์›ํ•˜๋Š” ๋งŒํผ ์‹คํ–‰๋˜๊ฒŒ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋ฌด์—‡๋ณด๋‹ค๋„ ๋ณธ์ธ์˜ ๋…ธํŠธ๋ถ์ด๋‚˜ ์ปดํ“จํ„ฐ๋ฅผ ์ผœ๋‘˜ ํ•„์š”์—†์ด ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•œ ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹ค๋งŒ, ํด๋ผ์šฐ๋“œ ํ™˜๊ฒฝ์—์„œ์˜ ๋ฐ์ดํ„ฐ ๋ถ„์„์‹œ ‘ํŒŒ์ผ’์„ ๋‹ค๋ฃจ๋Š” ๋ฐฉ๋ฒ•์ด ๊ธฐ์กด์˜ ์‚ฌ์šฉ์ž PC์—์„œ์™€ ์กฐ๊ธˆ ๋‹ค๋ฅด๋ฏ€๋กœ ์ด ๋ถ€๋ถ„์„ ์ค‘์‹ฌ์œผ๋กœ ์„ค๋ช…ํ•˜๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

2. Notebook ์‹ค์Šตํ™˜๊ฒฝ ๋งŒ๋“ค๊ธฐ

Python์€ ์Šคํฌ๋ฆฝํŠธ ์–ธ์–ด๋กœ์„œ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๋Š” ๋ฐ ๋“ค์ด๋Š” ์‹œ๊ฐ„์ด ๋‹ค๋ฅธ ์–ธ์–ด์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ์ ๋‹ค๋Š” ์ธก๋ฉด์—์„œ R๊ณผ ๋”๋ถˆ์–ด ๋งŽ์ด ์‚ฌ์šฉ๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ํฌ์ŠคํŒ…์—์„œ๋Š” IBM Cloud ํ™˜๊ฒฝ์—์„œ DSX Spark๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด Jupyter Notebook์—์„œ ๋ถ„์„์„ ์ง„ํ–‰ํ•˜๊ณ ์ž ํ•ฉ๋‹ˆ๋‹ค.

(1) IBM ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธ์Šค(DSX)์— ์ ‘์†ํ•ด ๋กœ๊ทธ์ธ์„ ํ•ฉ๋‹ˆ๋‹ค.

(2) [Get start] – [New Project] – [Add to Project] – [Notebook]์„ ์ฐจ๋ก€๋กœ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์˜ ํ™”๋ฉด์—์„œ ์ž์‹ ์ด ์›ํ•˜๋Š” ๊ฐœ๋ฐœ ํ™˜๊ฒฝ์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

(3) ์ต์ˆ™ํ•œ Notebook ์ธํ„ฐํŽ˜์ด์Šค๋ฅผ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด ๊ณณ์— ํŒŒ์ด์ฌ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜๋ฉด ๊ธฐ์กด์˜ Jupyter Notebook์—์„œ ์‚ฌ์šฉํ•˜๋˜ ๋Œ€๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด ํŽธ๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋˜ํ•œ Jupyter Notebook์—์„œ ์ผ๋˜ ๋Œ€๋ถ€๋ถ„์˜ ๋‹จ์ถ•ํ‚ค๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.


[๊ทธ๋ฆผ 1] IBM DSX Spark ์•„ํ‚คํ…์ฒ˜

์ง€๊ธˆ๊นŒ์ง€์˜ ๊ณผ์ •์„ ๊ฑฐ์ณ ๋งŒ๋“ค์–ด์ง„ Notebook์˜ IBM DSX Spark ์•„ํ‚คํ…์ฒ˜๋Š” [๊ทธ๋ฆผ 1]๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค[1]. [๊ทธ๋ฆผ 1] ์ฒ˜๋Ÿผ ์ถ”ํ›„์— ์ž์—ฐ์–ด์ฒ˜๋ฆฌ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋ฉด Watson API๋ฅผ ํ™œ์šฉํ•ด์„œ NLU ๋“ฑ์„ ํ˜ธ์ถœํ•˜๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ํŒŒ์ผ์„ IBM Cloud ํ™˜๊ฒฝ์— ์—…๋กœ๋“œ ํ•˜๊ณ  ์—…๋กœ๋“œํ•œ ํŒŒ์ผ์„ Jupyter Notebook์—์„œ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์ด ๊ณผ์ •์—์„œ Pandas๋‚˜ Spark Session ๋“ฑ์œผ๋กœ ๋ถˆ๋Ÿฌ์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

3. ํŒŒ์ผ ์—…๋กœ๋“œ/๋‹ค์šด๋กœ๋“œ

(1) ํŒŒ์ผ ์—…๋กœ๋“œ
๊ธฐ์กด Jupyter Notebook์˜ ๊ฒฝ์šฐ ํŒŒ์ผ์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ ํŒŒ์ผ ๊ฒฝ๋กœ๋ฅผ ์ง์ ‘ ์ง€์ •ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋Œ€๋ถ€๋ถ„์ด์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ, DSX ์ƒ์—์„œ ํŒŒ์ผ์„ ๋‹ค๋ฃจ๋ ค๋ฉด IBM Cloud์— ํŒŒ์ผ์„ ์—…๋กœ๋“œ ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์˜ ์Šคํฌ๋ฆฐ ์ƒท์— ๋‚˜ํƒ€๋‚œ [Data] ์•„์ด์ฝ˜์„ ํด๋ฆญํ•ฉ๋‹ˆ๋‹ค.

ํ•ด๋‹น ์œ„์น˜์— ํŒŒ์ผ์„ ์ง์ ‘ ๋งˆ์šฐ์Šค๋กœ ์˜ฌ๋ฆฌ๊ฑฐ๋‚˜ [browse] ๋ฒ„ํŠผ์„ ํด๋ฆญํ•ฉ๋‹ˆ๋‹ค. ํŒŒ์ผ์ด ์˜ฌ๋ผ๊ฐ„ ๊ฒƒ์ด ํ™•์ธ๋˜๋ฉด ๋‹ค์Œ ๋‹จ๊ณ„๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.

(2) ์—…๋กœ๋“œํ•œ ํŒŒ์ผ์„ Notebook์—์„œ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

์—…๋กœ๋“œํ•œ ํŒŒ์ผ์— [Insert to code] ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด csvํŒŒ์ผ์ธ ๊ฒฝ์šฐ ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฉ”๋‰ด๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค. ์—ฌ๊ธฐ์—์„œ [SparkSession Dataframe]์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜์™€ ๊ฐ™์€ ์ฝ”๋“œ๊ฐ€ ์‚ฝ์ž…๋ฉ๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import ibmos2spark
 
# @hidden_cell
credentials = {
    ‘auth_url’‘https://identity.open.softlayer.com’,
    ‘project_id’‘********************************’,
    ‘region’‘dallas’,
    ‘user_id’‘***************************’,
    ‘username’‘member_****************************************’,
    ‘password’‘**********’
}
 
configuration_name = ‘os_********************_configs’
bmos = ibmos2spark.bluemix(sc, credentials, configuration_name)
 
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df_data_1 = spark.read\
  .format(‘org.apache.spark.sql.execution.datasources.csv.CSVFileFormat’)\
  .option(‘header’‘true’)\
  .load(bmos.url(‘DefaultProjectyunhomaengkribmcom’‘testdata_devloperworks.csv’))
df_data_1.take(5)
cs

[Pandas Dataframe]๋„ ๋งˆ์ฐฌ๊ฐ€์ง€ ์ž…๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from io import StringIO
import requests
import json
import pandas as pd
 
# @hidden_cell
# This function accesses a file in your Object Storage. The definition contains your credentials.
# You might want to remove those credentials before you share your notebook.
def get_object_storage_file_with_credentials_****************************************(container, filename):
    “”“This functions returns a StringIO object containing
    the file content from Bluemix Object Storage.”“”
 
    url1 = .join([‘https://identity.open.softlayer.com’‘/v3/auth/tokens’])
    data = {‘auth’: {‘identity’: {‘methods’: [‘password’],
            ‘password’: {‘user’: {‘name’‘member_****************************************’,‘domain’: {‘id’‘****************************************’},
            ‘password’‘***************’}}}}}
    headers1 = {‘Content-Type’‘application/json’}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body[‘token’][‘catalog’]:
        if(e1[‘type’]==‘object-store’):
            for e2 in e1[‘endpoints’]:
                        if(e2[‘interface’]==‘public’and e2[‘region’]==‘dallas’):
                            url2 = .join([e2[‘url’],‘/’, container, ‘/’, filename])
    s_subject_token = resp1.headers[‘x-subject-token’]
    headers2 = {‘X-Auth-Token’: s_subject_token, ‘accept’‘application/json’}
    resp2 = requests.get(url=url2, headers=headers2)
    return StringIO(resp2.text)
 
df_data_2 = pd.read_csv(get_object_storage_file_with_credentials_****************************************(‘DefaultProject***************’‘testdata_devloperworks.csv’))
df_data_2.head()
cs

(3) ํŒŒ์ผ ์ฝ๊ธฐ

[SparkSession Dataframe]์˜ ๊ฒฝ์šฐ ์œ„์˜ ์ฝ”๋“œ๊ฐ€ ์ œ๋Œ€๋กœ ์‹คํ–‰๋˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ํ™”๋ฉด์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.

[Pandas Dataframe]์˜ ๊ฒฝ์šฐ ์œ„์˜ ์ฝ”๋“œ๊ฐ€ ์ œ๋Œ€๋กœ ์‹คํ–‰๋˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™์€ ํ™”๋ฉด์ด ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.

(4) ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ ํ•˜๊ธฐ

IBM DSX๋ฅผ ํ™œ์šฉํ•˜๋ฉด์„œ ๊ฐ€์žฅ ์–ด๋ ค์›€์„ ๊ฒช๋Š” ๋ถ€๋ถ„์ด ๋ถ„์„์„ ๋๋‚ธ ํŒŒ์ผ์„ ๋‹ค์šด ๋ฐ›๋Š” ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. Notebook์— ์™ธ๋ถ€ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ํ˜ธ์ถœํ•ด์„œ ๋‹ค์šด๋กœ๋“œ ๋ฒ„ํŠผ์„ ๋งŒ๋“ค์–ด๋ณด๊ธฐ๋„ ํ•˜๊ณ , Pixiedust๋กœ ๊ทธ๋ž˜ํ”„๋ฅผ ๊ทธ๋ฆฌ๊ณ  Export ๊ธฐ๋Šฅ์„ ํ™œ์šฉ์„ ํ•ด๋ณด๊ธฐ๋„ ํ•˜์ง€๋งŒ, csvํŒŒ์ผ ์šฉ๋Ÿ‰์ด ์ปค์ง€๊ฑฐ๋‚˜ ํ•œ๊ธ€์ด ๋“ค์–ด๊ฐ€๋ฉด ์ œ๋Œ€๋กœ ๋‹ค์šด๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

1
2
from pixiedust.display import *
display(dataframe.to_string())
cs

๋•Œ๋ฌธ์— CURL command๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. CURL command๋ฅผ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋‘ ๊ฐ€์ง€ ์ •๋ณด๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ๋กœ๋Š” Credential ์ •๋ณด ๋‘ ๋ฒˆ์งธ๋กœ๋Š” ํŒŒ์ผ์ด ์ €์žฅ๋œ ๊ฒฝ๋กœ. ์ด๋ฅผ ์•Œ์•„๋‚ด๊ธฐ ์œ„ํ•ด ์•„๋ž˜์™€ ๊ฐ™์ด ์ฝ”๋“œ๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

1
2
3
4
5
6
7
8
9
# Check the file path
path = ! pwd
print (path[0])
path2 = path[0].replace(“notebook/work”“”)
print (path2) 
full_path1 = path2 + “data/”+ filename1
 
# save dataframe on ibm cloud
dataframe.to_csv(full_path1, encoding=‘utf-8’)
cs

์œ„์˜ ์ฝ”๋“œ๋Š” ํ˜„์žฌ IBM Cloud์ƒ์˜ ํŒŒ์ผ ๊ฒฝ๋กœ๋ฅผ ํ™•์ธํ•˜๊ณ , ์™ธ๋ถ€ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ๊ฐ€๋Šฅํ•œ ๊ฒฝ๋กœ๋กœ csvํŒŒ์ผ์„ ์ €์žฅํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

1
2
!ls -l /gpfs/fs01/user/s63b-*************-*************/data
 
cs

์ด์ „ ๋‹จ๊ณ„์—์„œ ์ €์žฅํ•œ ํŒŒ์ผ๋“ค์ด ์ œ๋Œ€๋กœ ์œ„์น˜ํ•˜๊ณ  ์žˆ๋Š”์ง€ ์กฐํšŒํ•ฉ๋‹ˆ๋‹ค.

Credential ์ •๋ณด๋ฅผ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ์œ„ํ•ด IBM Cloud ์‚ฌ์ดํŠธ๋กœ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์ดํŠธ์—์„œ DSX๊ฐ€ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š” spark์˜ credential ์ •๋ณด๋ฅผ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค. ํ™•์ธํ•œ Credential ์ •๋ณด์—์„œ tenant_id, tenant_secret, instance_id๋ฅผ ํ™•์ธํ•˜๊ณ  ์•„๋ž˜์˜ curl command๋ฅผ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.

1
curl -O -X GET -u ****-tenant_id-**********: tenant_secret-****-****-****-*********** -H ‘X-Spark-service-instance-id: instance_id-****-****-****-**********’ https://spark.bluemix.net/tenant/data/dataframe.csv
cs

์ด์ œ ์‚ฌ์šฉ์ž PC์—์„œ CURL ์ปค๋งจ๋“œ๋ฅผ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค. ์•„๋ž˜์™€ ๊ฐ™์ด ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ๋ฌผ์„ ์‚ฌ์šฉ์ž PC๋กœ ๋‹ค์šด๋กœ๋“œ ๋ฐ›์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

(5) Object Storage๋งŒ ์‚ฌ์šฉํ•œ ๊ฒฝ์šฐ

Object Storage๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, ์•„๋ž˜์™€ ๊ฐ™์ด IBM Cloud Console์—์„œ ๊ฐ„ํŽธํ•˜๊ฒŒ [Select Action] – [File Download]๋กœ ์›ํ•˜๋Š” ํŒŒ์ผ์„ ์‚ฌ์šฉ์ž PC๋กœ ๋‹ค์šด๋กœ๋“œ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

4. ๋งบ์Œ๋ง

๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๊ฑฐ๋‚˜, ์ฃผ๊ธฐ์ ์œผ๋กœ ์‹คํ–‰๋˜์–ด์•ผ ํ•˜๋Š” ํฌ๋กค๋ง ์Šคํฌ๋ฆฝํŠธ ๋“ฑ IBM Cloud์˜ DSX๋ฅผ ํ™œ์šฉํ•˜๋ฉด ๋ณด๋‹ค ์‰ฝ๊ฒŒ ๋ถ„์„์„ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋•Œ๋กœ๋Š” ๋” ๋งŽ์€ ์ปดํ“จํŒ… ํŒŒ์›Œ์™€ ์Šคํ† ๋ฆฌ์ง€๋ฅผ ํ•„์š”ํ•  ์ˆ˜๋„ ์žˆ์ฃ . ์ด๋Ÿฌํ•œ ์ƒํ™ฉ์—์„œ IBM Cloud๋Š” ์ƒ์œ„ ํ‹ฐ์–ด์˜ ์ปดํ“จํŒ…ํŒŒ์›Œ์™€ ์Šคํ† ๋ฆฌ์ง€ ์„ ํƒ์„ ํ•  ์ˆ˜ ์žˆ์–ด ํ•œ ๋ฒˆ์˜ ์ฝ”๋”ฉ์œผ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๋ถ„์„ ํ™˜๊ฒฝ์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Reference

[1] Watson ํ…์ŠคํŠธ ๋ถ„๋ฅ˜ ํ™•์žฅํ•˜๊ธฐ, https://developer.ibm.com/kr/journey/extend-watson-text-classification/

์ž‘์„ฑ์ž: ๋งน์œคํ˜ธ

ํ† ๋ก  ์ฐธ๊ฐ€

์ด๋ฉ”์ผ์€ ๊ณต๊ฐœ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ํ•„์ˆ˜ ์ž…๋ ฅ์ฐฝ์€ * ๋กœ ํ‘œ์‹œ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.