Data Engineering/Apache Spark

spark-submit 하드웨어 옵션 체크하기

Joon09 2022. 2. 7. 14:44
반응형

페이지 참고사항 (계산식)

https://aws.amazon.com/ko/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/#:~:text=For example%2C the default for,application based on the workloads.


계산된 옵션 목록

executor-cores

executor-memory

spark.driver.cores

driver-memory

num_executors

spark.default.parallelism


Linux Cpu, Memory 확인 방법

# Cpu core 확인
lscpu

# Memory 확인
free -h

Source (Shell)

#!/usr/bin/env bash

# 서버의 전체 CPU CORE 개수
CPU_CORE=64
# 서버의 전체 Memory 용량 - GB
MEMORY=40
# Spark에서 사용할 메모리의 총 용량, yarn overhead memory 용량 제외 (1.0 - overhead)
SPARK_MEMORY_FRACTION=0.9
# spark node의 개수 ()
SPARK_NODE_COUNT=1
# 각 노드당 사용할 Cpu Core의 개수, 상수로 5개가 가장 최적화라고 함
SPARK_EXECUTORS_CORES=5

# 계산
NUMBER_OF_EXECUTORS_PER_INSTANCE=$(expr $(expr ${CPU_CORE} - 1) / ${SPARK_EXECUTORS_CORES})
SPARK_EXECUTOR_MEMORY_1=$(echo "${MEMORY}/${NUMBER_OF_EXECUTORS_PER_INSTANCE}" | bc)
SPARK_EXECUTOR_MEMORY=$(printf %.$1f $(echo "scale=0;${SPARK_EXECUTOR_MEMORY_1}*${SPARK_MEMORY_FRACTION}" | bc -l))
SPARK_DRIVER_MEMORY=${SPARK_EXECUTOR_MEMORY}
SPARK_DRIVER_CORES=${SPARK_EXECUTORS_CORES}
SPARK_INSTANCE=$(echo "${NUMBER_OF_EXECUTORS_PER_INSTANCE}*${SPARK_NODE_COUNT}-1" | bc)
SPARK_DEFAULT_PARALLELISM=$(echo "${SPARK_INSTANCE}*${SPARK_EXECUTORS_CORES}*2" | bc)

echo "------- spark-submit Options -------"
echo "executor-cores = ${SPARK_EXECUTORS_CORES}"
echo "executor-memory = ${SPARK_EXECUTOR_MEMORY}"
echo "spark.driver.cores = ${SPARK_DRIVER_CORES}"
echo "driver-memory = ${SPARK_DRIVER_MEMORY}"
echo "num_executors = ${SPARK_INSTANCE}"
echo "spark.default.parallelism = ${SPARK_DEFAULT_PARALLELISM}"
echo "------- spark-submit Options -------"
반응형