Data Engineering/Apache Spark
spark-submit 하드웨어 옵션 체크하기
Joon09
2022. 2. 7. 14:44
반응형
페이지 참고사항 (계산식)
https://aws.amazon.com/ko/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/#:~:text=For example%2C the default for,application based on the workloads.
계산된 옵션 목록
executor-cores
executor-memory
spark.driver.cores
driver-memory
num_executors
spark.default.parallelism
Linux Cpu, Memory 확인 방법
# Cpu core 확인
lscpu
# Memory 확인
free -h
Source (Shell)
#!/usr/bin/env bash
# 서버의 전체 CPU CORE 개수
CPU_CORE=64
# 서버의 전체 Memory 용량 - GB
MEMORY=40
# Spark에서 사용할 메모리의 총 용량, yarn overhead memory 용량 제외 (1.0 - overhead)
SPARK_MEMORY_FRACTION=0.9
# spark node의 개수 ()
SPARK_NODE_COUNT=1
# 각 노드당 사용할 Cpu Core의 개수, 상수로 5개가 가장 최적화라고 함
SPARK_EXECUTORS_CORES=5
# 계산
NUMBER_OF_EXECUTORS_PER_INSTANCE=$(expr $(expr ${CPU_CORE} - 1) / ${SPARK_EXECUTORS_CORES})
SPARK_EXECUTOR_MEMORY_1=$(echo "${MEMORY}/${NUMBER_OF_EXECUTORS_PER_INSTANCE}" | bc)
SPARK_EXECUTOR_MEMORY=$(printf %.$1f $(echo "scale=0;${SPARK_EXECUTOR_MEMORY_1}*${SPARK_MEMORY_FRACTION}" | bc -l))
SPARK_DRIVER_MEMORY=${SPARK_EXECUTOR_MEMORY}
SPARK_DRIVER_CORES=${SPARK_EXECUTORS_CORES}
SPARK_INSTANCE=$(echo "${NUMBER_OF_EXECUTORS_PER_INSTANCE}*${SPARK_NODE_COUNT}-1" | bc)
SPARK_DEFAULT_PARALLELISM=$(echo "${SPARK_INSTANCE}*${SPARK_EXECUTORS_CORES}*2" | bc)
echo "------- spark-submit Options -------"
echo "executor-cores = ${SPARK_EXECUTORS_CORES}"
echo "executor-memory = ${SPARK_EXECUTOR_MEMORY}"
echo "spark.driver.cores = ${SPARK_DRIVER_CORES}"
echo "driver-memory = ${SPARK_DRIVER_MEMORY}"
echo "num_executors = ${SPARK_INSTANCE}"
echo "spark.default.parallelism = ${SPARK_DEFAULT_PARALLELISM}"
echo "------- spark-submit Options -------"
반응형