You are viewing a preview of this job. Log in or register to view more details about this job.

Job Role- Data Engineer

Location- Dallas

Salary Range- 60K TO 70K

 

Job Description

 

Description of the Role and Key Responsibilities:

Associates proficient in data/ETL development and testing with hands-on PySpark and exposure to big data tools. Ability to perform below responsibilities: · Develop and execute PySpark test scripts for ETL pipelines, data transformations, and quality validations · Design & Develop PySpark test framework design focusing on reusable modules for batch processing · Prepare basic PySpark test scripts for ETL validations (e.g., row counts, null checks) · Run data validation queries on Hive and check Hadoop data loads · Create/update test cases in Zephyr and log defects in Jira/ServiceNow · Execute automated PySpark regression tests in CI/CD pipelines · Perform basic data quality checks (completeness, duplicates) using PySpark/Hive scripts/SQL queries · Perform root-cause analysis for Hadoop job failures in AutoSys scheduling and collaborate on fixes with stakeholders · set up small-scale test data(<100GB) in Hadoop · Develop Unix shell scripts for the Pyspark framework set up and scheduling


Qualification and Specialization:

Bachelor Of Science /Technology


Unique Experience from this Role:

The role offers hands-on exposure to building enterprise-scale PySpark solutions, spanning development, job automation, scheduling, governance, and production-ready data engineering practices.


Learning outcomes for the Trainee:

· Develop and optimize PySpark DataFrame–based ETL pipelines using structured and semi-structured data sources for large-scale processing. · Design reusable PySpark frameworks with effective use of transformations, actions, joins, window functions, and performance tuning techniques. · Build end-to-end batch data pipelines that cater to business functionalities and data enrichment logic · Create Unix shell scripts to drive PySpark executions with runtime arguments, logging, and failure controls. · Operate and manage scheduled Spark workloads through Autosys with job sequencing and dependency awareness. · Ensure code quality, scalability, and operational stability through configuration management, debugging, and adherence to enterprise standards.