Python code Apply Preprocessing Techniques
import os
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import KBinsDiscretizer
Get the current working directory
current_directory = os.getcwd()
Construct the file path
file_name = “exam-results-list-excel-table.xlsx”
file_path = os.path.join(current_directory, file_name)
Load dataset from Excel file
data = pd.read_excel(file_path)
Save original dataset
data.to_csv(os.path.join(current_directory, “original_dataset.csv”), index=False)
Aggregation (Example: Aggregate numeric columns)
aggregated_data = data.groupby(‘Exam Name’).agg({‘Points’: ‘mean’})
aggregated_data.to_csv(os.path.join(current_directory, “aggregated_dataset.csv”))
Sampling (Example: Random sampling)
sampled_data = data.sample(n=5, random_state=42) # Randomly sample 5 rows
sampled_data.to_csv(os.path.join(current_directory, “sampled_dataset.csv”), index=False)
Dimensionality Reduction (Example: PCA)
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data[[‘Points’]])
pd.DataFrame(reduced_data, columns=[‘PCA_Component_1’, ‘PCA_Component_2’]).to_csv(os.path.join(current_directory, “reduced_dataset.csv”), index=False)
Feature subset selection (Example: SelectKBest)
X = data.drop(columns=[‘Student Name’])
y = data[‘Points’]
selector = SelectKBest(score_func=f_regression, k=2)
selected_features = selector.fit_transform(X, y)
pd.DataFrame(selected_features, columns=X.columns[selector.get_support()]).to_csv(os.path.join(current_directory, “selected_features.csv”), index=False)
Feature creation (Example: Text data to bag-of-words)
text_data = data[‘Student Name’]
vectorizer = CountVectorizer()
text_features = vectorizer.fit_transform(text_data)
pd.DataFrame(text_features.toarray(), columns=vectorizer.get_feature_names_out()).to_csv(os.path.join(current_directory, “text_features.csv”), index=False)
Discretization and Binarization (Example: KBinsDiscretizer)
discretizer = KBinsDiscretizer(n_bins=3, encode=‘ordinal’, strategy=‘uniform’)
discretized_data = discretizer.fit_transform(data[[‘Points’]])
pd.DataFrame(discretized_data, columns=[‘Discretized_Points’]).to_csv(os.path.join(current_directory, “discretized_dataset.csv”), index=False)
Attribute Transformation (Example: Standardization)
scaler = StandardScaler()
transformed_data = scaler.fit_transform(data[[‘Points’]])
pd.DataFrame(transformed_data, columns=[‘Standardized_Points’]).to_csv(os.path.join(current_directory, “transformed_dataset.csv”), index=False)
Write a documented report which states all the details about the dataset and how it is changed after applying the pre-processing techniques.