This story was originally published on HackerNoon at:
https://hackernoon.com/optimizing-distributed-data-processing-for-ml-at-scale.
A practitioner's guide to ML data pipeline performance: read the query plan first, eliminate shuffle, fix file layout, handle skew, prune columns
Check more stories related to data-science at:
https://hackernoon.com/c/data-science.
You can also check exclusive content about
#spark,
#pyspark,
#machine-learning,
#data-engineering,
#performance-optimization,
#distributed-systems,
#distributed-data-processing,
#optimizing-distributed-data, and more.
This story was written by:
@seshendranath. Learn more about this writer by checking
@seshendranath's about page,
and for more stories, please visit
hackernoon.com.
Stop tuning knobs on a broken foundation shuffle, file layout, skew, and column pruning do more for ML pipeline performance than any clever algorithm.