The complete beginning with Data Wrangling for machine learning python

Data wrangling is a broad term used, often informally, to describe the process of converting raw data to a clean and organized format ready for use. For us, data wrangling is only one step in preprocessing our data, but it is a big step. The most basic data structure used to “wrangle” data is the data frame, which can be both intuitive and incredibly versatile. Data frames are tabular, suggesting that they are based on rows and columns like you would see in a spreadsheet. Here is a data frame generated from data about passengers on the Titanic.

There are three major things to notice in this data frame. First, in a data frame every row corresponds to one observation (e.g., a passenger) and every column corresponds to one feature (gender, age, etc.). For example, by observing at the first observation we can recognize that Miss Elisabeth Walton Allen stayed in first class, was 29 years old, was female, and survived the disaster.

Next, each column contains a name (e.g., Name, PClass, Age) and each row contains an index number (e.g., 0 for the lucky Miss Elisabeth Walton Allen). We will apply these to select and manipulate observations and features.

Third, two columns, Sex and SexCode, contain the same data in different formats. In Sex, a woman is indicated by the string female, while in SexCode, a woman is shown by using the integer 1. We will want all our articles to be unique, and therefore we will require to remove one of these columns.

In this tutorial, we will cover a wide variety of techniques to manipulate information frames using the panda’s library with the goal of creating a clean, well-structured set of views for further preprocessing.


≡ Creating a Data Frame


pandas have many ways of creating a new DataFrame object. One simple method is to create an empty data frame using DataFrame and then define all column separately:


≡ Analysis


pandas allow what can feel like an infinite number of ways to create a DataFrame. In the real world, forming an empty DataFrame and then populating it will virtually never happen. Instead, our DataFrames will be generated from real data we have loading from other sources (e.g., a CSV file or database).


≡ Describing the Data



≡ Analysis


After we load some data, it is a good concept to learn how it is structured and what kind of information it contains. Ideally, we would view the full data quickly. But with most real-world problems, the data could have thousands to hundreds of thousands to
millions of rows and columns.

Instead, we have to rely on pulling examples to view small slices and calculating the review statistics of the data. we are working a toy dataset of the passengers of the Titanic on her last voyage. Using head we can use a look at the first few rows (five by default) of the data.

Alternatively, we can apply the tail to view the last few rows. With shape, we can recognize how many rows and columns our DataFrame seats. And lastly, with describe we can see some basic descriptive statistics for any numerical column. It is worth noting that case statistics do not always tell the full story.

For example, pandas treat the columns Survived and SexCode as numeric columns because they contain 1s and 0s. But, in this case, the numerical values represent categories. For example, if Survived equals 1, it means that the passenger survived the
disaster. For this purpose, some of the summary statistics given don’t make sense,such as the standard deviation of the SexCode column (an indicator of the passenger’s gender).


≡ Navigating DataFrames



≡ Analysis


All rows in a pandas DataFrame become a single index value. By default, this index is an integer indicating the row position in the DataFrame; but, it does not have to be. DataFrame indexes can be set to be unique alphanumeric strings or client
numbers. To select individual rows and slices of rows, pandas provides two methods:

>>loc is useful when the index of the DataFrame is a label (e.g., a string).
>> iloc works by studying for the position in the DataFrame. For example, iloc[0]

will replace the first row regardless of whether the index is an integer or a label. It is helpful to be comfortable with both loc and iloc since they will come up a lot during data cleaning.


≡ Selecting Rows Based on Conditionals



≡ Analysis


Conditionally choosing and filtering data is one of the most usual tasks in data wrangling. You unusually want all the raw data from the source instead. you are interested in only any subsection of it. For example, you might only be interested in stores
in certain cases or the records of patients over a certain age.


≡ Replacing Values



≡ Analysis


replace is a tool we use to replace values that is easy and yet have the great ability to accept regular expressions.


≡ Renaming Columns



≡ Analysis


Using rename with a dictionary as an argument to the columns parameter is my favored way to rename columns because it works with any number of columns. If we want to rename all columns at once, this important snippet of code creates a dictionary
with the old column names as keys and empty strings as values:

# Load library
import collections
# Create dictionary
column_names = collections.defaultdict(str)
# Create keys
for name in dataframe.columns:
 column_names[name]
# Show dictionary
column_names
defaultdict(str,
           {'Age': '',
            'Name': '',
            'PClass': '',
            'Sex': '',
             'SexCode': '',
             'Survived': ''})

≡ Finding the Minimum, Maximum, Sum, Average, and Count



≡ Analysis


In extension to the statistics applied in the solution, pandas offer variance (var), standard deviation (std), kurtosis (kurt), skewness (skew), standard error of the mean (sem), mode (mode), median (median), and a number of others.


≡ Finding Unique Values



≡ Analysis


Both unique and value_counts are helpful for manipulating and exploring categorical columns. Very frequently in categorical columns, there will be classes that require to be managed in the data wrangling phase. For example, in the Titanic dataset, PClass is a column registering the class of a passenger’s ticket. There were three classes on the Titanic; however, if we use value_counts we can see a problem:

# Show counts
dataframe['PClass'].value_counts()

≡ Handling Missing Values



≡ Analysis


Missing values are a universal problem in data wrangling, yet many underestimate the complexity of working with missing data. pandas use NumPy’s NaN value to denote missing values, but it is important to note that NaN is not fully implemented natively in pandas.


≡ Deleting a Column



≡ Analysis


drop is the idiomatic system of deleting a column. An alternative system is deldataframe[‘Age’], which works most of the time but is not approved because of how it is called within pandas . One way I suggest learning is to never use pandas’ inplace=True argument. Many pandas methods include an in-place parameter, which if True edits the DataFrame directly. This can lead to problems in more difficult data processing pipelines because we are treating the DataFrames as mutable objects (which they technically are). I suggest treating DataFrames as immutable objects. For example:

# Create a new DataFrame
dataframe_name_dropped = dataframe.drop(dataframe.columns[0], axis=1)

In this example, we are not mutating the DataFrame data frame but instead are making a new DataFrame that is an altered version of the data frame called data frame_name_dropped. If you treat your DataFrames as living purposes, you will
save yourself a lot of headaches following the road.


Dropping Duplicate Rows



≡ Grouping Rows by Values



≡ Analysis


groupby is wherever data wrangling actually starts to take shape. It is very simple to have a DataFrame where each row is a person or an issue and we want to group them according to some criterion and then calculate a statistic. For example, you can imagine a DataFrame wherever every row is an individual sale at a national restaurant chain and we want the total sales per restaurant. We can achieve this by grouping rows by individual restaurants and then calculating the sum of each group.
Users new to groupby often write a line like this and are confused by what is returned:

# Group rows
dataframe.groupby('Sex')

Why didn’t it return something more useful? The idea is that groupby needs to be paired with some operation we want to appeal to each group, such as calculating an aggregate statistic (e.g., mean, median, sum). When talking about grouping we usually
use shorthand and say “group by gender,” but that is incomplete. For grouping to be useful, we need to group by something and then apply a function to each of those groups:

# Group rows, count rows
dataframe.groupby('Survived')['Name'].count()

Notice Name added after the groupby? That is because special summary statistics are only essential to certain types of data. For example, while calculating the average age by gender makes sense, calculating the absolute age by gender does not. In this
case we group the data into lasted or not, then count the number of names (i.e., passengers) in each group. We can also group by a first column, then group that grouping by a second column:

# Group rows, calculate mean
dataframe.groupby(['Sex','Survived'])['Age'].mean()

≡ Grouping Rows by Time



≡ Analysis


Our regular Titanic dataset does not contain a DateTime column, so for this method, we have created a simple DataFrame where each row is an individual sale. For every sale, we know its date and time and its dollar amount. The raw data looks like this:

# Show three rows
dataframe.head(3)

See that the date and time of each sale is the index of the DataFrame; this is because resampled wants the index to be DateTime-like values. Using resample we can group the rows by a wide collection of time periods (offsets) and then we can calculate some statistic on each time group:

# Group by two weeks, calculate mean
dataframe.resample('2W').mean()
# Group by month, count rows
dataframe.resample('M').count()

You might notice that in the two outputs the DateTime index is a date despite the fact that we are grouping by weeks and months, each. The reason is that by default resample returns the label of the right “edge” (the last label) of the time group. We can control this behavior using the label parameter:

# Group by month, count rows
dataframe.resample('M', label='left').count()

≡ Looping Over a Column



≡ Analysis


In addition to loops (often called for loops), we can also use list comprehensions:

# Show first two names uppercased
[name.upper() for name in dataframe['Name'][0:2]]

Despite the temptation to fall back on for loops, a more Pythonic solution would apply pandas’ apply method, defined in the next method.


≡ Applying a Function Over All Elements in a Column



≡ Analysis


apply is an excellent way to do data cleansing and wrangling. It is common to write a function to perform some useful operation (separate first and last names, convert strings to floats, etc.) and then map that function to each element in a column.

More Machine Learning Tutorial 

Visit My Blog 

 537 total views,  1 views today