Vendor should manage the existing Data pipelines built for data ingestion.
Create and manage new data pipelines following the best practises for the new ingestion of data.
Continuously monitor the data ingestion through Change Data Capture for the incremental load
Any failed batch job schedule to be analysed and fixed to capture the data
Maintaining and continuously updating on the technical documentation of the ingested data and maintaining the centralized data dictionary, with necessary data classifications.
2 Data Extraction and Cleaning
Extraction of data from the data sources to be cleaned and ingested into big data platform
Automation of data cleaning has to be defined before ingestions
Data cleaning to handle the missing data and remove any outliers and resolve any inconsistencies
Data quality check has to be performed in terms of accuracy, completeness, consistency, timeliness, believability, and interpretability
3 Data Integration, Aggregation and Representation
Exposing of Data views or Data models to Reporting and source systems using Hive or Impala, or similar tools provided by us
Exposing of cleansed data to Artificial Intelligence team for building data science models
Below are skills which are required but not limited to
Skill Set:
Expertise in Big Data querying tools, such as Hive, Hbase and Impala.
Expertise in SQL, writing complex queries/views, partitions, bucketing
Strong Experience in Spark and Storm.
Expertise in Monitoring Tools like Grafana,Prometheous,Zabbix.
Hands on experience in Management of Hadoop Pentaho.
Expertise in storage layer/DB like HDFS,Cassandra,Opentsd Expertise in Event-driven Architecture.