Article

Concepts of Data Preprocessing

Topic: Distance Learning and E-LearningPublished July 4, 2020

Legacy signals

Legacy popularity: 629 legacy views

Data pre-processing is a data mining technique which is used to transform raw data into a useful format. rnSteps Involved in Data Pre-processing:rn1. Data Cleaningrn“The idea of imputation is both seductive and dangerous” (R.J.A Little & D.B. Rubin) One of the most common problems I have faced in Exploratory Analysis is handling the missing values. I feel like that there is NO good way to deal with missing data. There are loads of different solutions for data imputation depending on the kind of problem — Time series Analysis, ML, Regression etc. and it is much more difficult to choose between them. So, let’s explore the most commonly used methods and try to find some solutions that fit our needs. The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc. Missing DataThis situation arises when some data is missing in our dataset. Before jumping to the various methods of handling missing data, we have to understand the reason why data goes missing.rnData Goes Missing Randomly :Missing at random means that the case in which a data point is missing, the reason for missing data is not related to the observed dataset.rnData Goes Missing not Randomly :Two possible cases for missing data can be – it depends on the hypothetical value or, it is dependent on some other variable’s values. People with high salaries generally do not want to reveal their incomes in surveys, this can be an example for first case and, we can think that the missing value was actually quite large and can fill it with some hypothetical value. And, an instance for latter case can be, females generally don’t want to reveal their ages! Here, the missing value in age column is impacted by gender column.rnIf data goes missing randomly, it is safe to remove the tuples with occurrences of missing values, while in the other case removing observations with missing values can produce a bias in the model. So we have to be quite bold while removing some tuples.rnP.S. – Data imputation does not guarantee better results.rnDropping ObservationsTuple deletion removes all data for an observation that has one or more missing values. If the missing data is limited to a small number of observations, you may just opt to eliminate those cases from the dataset. However, in most cases, it can produce bias in the analysis because we can never be totally sure that the data has gone missing randomly.rnmydata.dropna(inplace=True)rnDropping VariablesThe better choice always is keep data than discarding it. Sometimes you can drop variables (columns) if the data for that particular column is missing for more than 60% rows but only if that column is insignificant. But, still, dropping tuples is always preferred choice over dropping columns.rndel mydata.column_namernmydata.drop(‘column_name’, axis=1, inplace=True)rnFill the missing valuesThere are various ways to do this task. You can choose to fill the missing values manually, by using mean, mode or median.rnUtilising the overall mean, median or mode is a very straight-forward imputation method. It is quite fast to perform, but has clear disadvantages, one of them being that mean imputation reduces variance in the dataset.rnfrom sklearn.preprocessing import Imputerrnvalues = mydata.valuesrnimputer = Imputer(missing_values=’NaN’, strategy=’mean’)rntransformed_values = imputer.fit_transform(values) # strategy can be changed to “median” and “most_frequent”rnRegression:Data can be made smooth by fitting it into a regression function. The regression used can be linear (having one independent variable) or multiple (having multiple independent variables).rnTo start, most significant variables are identified using a correlation matrix. They are used as independent variables in a regression equation. The dependent variable is the one which has got missing values. Tuples having complete data are used to generate the regression equation; the equation is then used to predict missing values for dependent variable.rnIt provides good estimates for missing values. However, there are several disadvantages of this model which tend to overshadow the advantages. First, since the inserted values were predicted from other variables they fit together very easily and so standard error becomes biased. Another one, we also assume that there is a linear relationship between the variables used in the regression equation when there may not be one.rnKNN (K Nearest Neighbours)In this method, k neighbours are chosen based on some distance measure and their average is used as an hypothetical value which can be used to fill up the missing data. KNN can predict both discrete values (most frequent value among the k nearest neighbours) and continuous values (mean among the k nearest neighbours)rnDifferent formulas / concepts are used for calculating distance according to the type of data:rnContinuous Data: Most commonly used distance formulas are – Euclidean, Manhattan and CosinernCategorical Data: Hamming distance is generally used for categorical imputation. It iterates through all the categorical attributes and for each, counts one if the value is not the same between two tuples for that variable. The number of attributes for which the value was different is considered as the Hamming distance.One of the drawbacks of the KNN algorithm is that it becomes time-consuming when we try to analyse large datasets because it searches for similar instances through the entire dataset. Moreover, if we are dealing with high-dimensional data, KNN’s accuracy can severely have a downfall because there seems to be little difference between the nearest and farthest neighbour in multiple dimensions.rnfrom fancyimpute import KNNrn# Use 5 nearest rows which have a feature to fill in each row’s missing featuresrnknnOutput = KNN(k=5).complete(mydata)rnRead Full Article Here – https://brain-mentors.com/concepts-of-data-pre-processing/

Further reading

Further Reading

4 total

Article

Conceptual understanding is the bedrock of success in the challenging CA exams. Rote memorization might help in the short term, but a deep grasp of the underlying principles is essential for tackling complex questions, applying your knowledge effectively, and ultimately, excelling in your exams and career. This guide outlines strategies to cultivate a strong conceptual understanding for CA exams. I. Focus on the "Why" Not Just the "How" Go Beyond Formulas and Procedures: Don'

February 6, 2025

Article

Table of Contents Introduction Benefits of Online Education Current Trends in Online Learning Overcoming Challenges in Online Education Essential Strategies for Effective Online Learning The Role of Technology in Online Education Long-term Impacts of Online Learning Expert Insights and Recommendations Conclusion Introduction Online education has seen remarkable growth, offering flexibility and accessibility to learners worldwide. This educational trend has been seen prominent

November 16, 2024

Article

Over the internet pai gow poker at the moment are by far the most widely used options activities across the world from over the internet gambling. Aided by the simplicity of using because of any where, typically the wide variety of motifs, and then the possibility critical winnings, over the internet pai gow poker provide a fantastic igaming past experiences who gets innumerable individuals across the world. Even if you could be some student maybe a master bettor, understandi

September 12, 2024

Article

Unlock the full potential of your career with Oracle Fusion SCM Training – a definitive guide for professionals aiming to excel in the fast-paced world of supply chain management. Exploring the Essentials of Oracle Fusion SCM Oracle Fusion SCM is a comprehensive supply chain management solution offered by Oracle. It includes a wide range of modules and features that help businesses streamline their supply chain operations, improve efficiency, and reduce costs. In this secti

August 19, 2024