[雪峰磁针石博客] python模块介绍-数据分析利器pandas-数据清洗和准备


声明:本文转载自https://my.oschina.net/u/1433482/blog/1623037,转载目的在于传递更多信息,仅供学习交流之用。如有侵权行为,请联系我,我会及时删除。

数据清洗和准备

在进行数据分析和建模的过程中,需要花费大量的时间(80%或更多)在数据准备上:加载,清理,转换和重新排列。有时候数据存储在文件或数据库中的方式不适合特定任务的格式。

在本章讨论缺失数据,重复数据,字符串操作,和其他一些分析数据转换。

处理缺失数据

数值用浮点数NaN (Not a Number)表示缺失。

In [1]: import numpy as np  In [2]: import pandas as pd  In [3]: string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])  In [4]: string_data Out[4]:  0     aardvark 1    artichoke 2          NaN 3      avocado dtype: object  In [5]: string_data.isnull() Out[5]:  0    False 1    False 2     True 3    False dtype: bool  In [6]: string_data[0] = None  In [7]: string_data.isnull() Out[7]:  0     True 1    False 2     True 3    False dtype: bool  NA相关的处理方法

数据缺失用NA(not available)表示, python内置的None也为NA。

Argument Description  
dropna Filter axis labels based on whether values for each label have missing data, with varying thresholds for how  
much missing data to tolerate.    
fillna Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill' .  
isnull Return boolean values indicating which values are missing/NA.  
notnull Negation of isnull .  
In [8]: from numpy import nan as NA  In [9]: data = pd.Series([1, NA, 3.5, NA, 7])  In [10]: data.dropna() Out[10]:  0    1.0 2    3.5 4    7.0 dtype: float64  In [11]: data[data.notnull()] Out[11]:  0    1.0 2    3.5 4    7.0 dtype: float64  In [12]: data = pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],    ....: [NA, NA, NA], [NA, 6.5, 3.]])  In [13]: cleaned = data.dropna()  In [14]: data Out[14]:       0    1    2 0  1.0  6.5  3.0 1  1.0  NaN  NaN 2  NaN  NaN  NaN 3  NaN  6.5  3.0  In [15]: cleaned Out[15]:       0    1    2 0  1.0  6.5  3.0  In [16]: data.dropna(how='all') Out[16]:       0    1    2 0  1.0  6.5  3.0 1  1.0  NaN  NaN 3  NaN  6.5  3.0  In [17]: data[4] = NA  In [18]: data Out[18]:       0    1    2   4 0  1.0  6.5  3.0 NaN 1  1.0  NaN  NaN NaN 2  NaN  NaN  NaN NaN 3  NaN  6.5  3.0 NaN  In [19]: data.dropna((axis='columns', how='all') Out[19]:       0    1    2 0  1.0  6.5  3.0 1  1.0  NaN  NaN 2  NaN  NaN  NaN 3  NaN  6.5  3.0

how='all'要所有行都为NaN时才会删除。thresh参数可以指定NA的个数。

In [21]: df = pd.DataFrame(np.random.randn(7, 3))  In [22]: df.iloc[:4, 1] = NA  In [23]: df.iloc[:2, 2] = NA  In [24]: df Out[24]:            0         1         2 0 -0.843340       NaN       NaN 1 -1.305941       NaN       NaN 2  1.026378       NaN  2.176567 3  0.048885       NaN  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [25]: df.dropna() Out[25]:            0         1         2 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [26]: df.dropna(thresh=2) Out[26]:            0         1         2 2  1.026378       NaN  2.176567 3  0.048885       NaN  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058

fillna用来对缺失值进行填充。可以针对列进行填充,用上一行的值填充,用平均值填充等。

In [27]: df.fillna(0) Out[27]:            0         1         2 0 -0.843340  0.000000  0.000000 1 -1.305941  0.000000  0.000000 2  1.026378  0.000000  2.176567 3  0.048885  0.000000  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [28]: df.fillna({1: 0.5, 2: 0}) Out[28]:            0         1         2 0 -0.843340  0.500000  0.000000 1 -1.305941  0.500000  0.000000 2  1.026378  0.500000  2.176567 3  0.048885  0.500000  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [29]: _ = df.fillna(0, inplace=True)  In [30]: df Out[30]:            0         1         2 0 -0.843340  0.000000  0.000000 1 -1.305941  0.000000  0.000000 2  1.026378  0.000000  2.176567 3  0.048885  0.000000  0.012649 4  0.591212 -0.739625  1.017533 5  0.633873 -0.124162 -0.823495 6 -1.537827  0.802565  0.359058  In [31]: df = pd.DataFrame(np.random.randn(6, 3))  In [32]: df.iloc[2:, 1] = NA  In [33]: df.iloc[4:, 2] = NA  In [34]: df Out[34]:            0         1         2 0 -0.081265 -0.820770 -0.746845 1  1.150648  0.977842  0.861825 2  1.823679       NaN  1.272047 3  0.293133       NaN  0.273399 4  0.235116       NaN       NaN 5  1.365186       NaN       NaN  In [35]: df.fillna(method='ffill') Out[35]:            0         1         2 0 -0.081265 -0.820770 -0.746845 1  1.150648  0.977842  0.861825 2  1.823679  0.977842  1.272047 3  0.293133  0.977842  0.273399 4  0.235116  0.977842  0.273399 5  1.365186  0.977842  0.273399  In [36]: df.fillna(method='ffill', limit=2) Out[36]:            0         1         2 0 -0.081265 -0.820770 -0.746845 1  1.150648  0.977842  0.861825 2  1.823679  0.977842  1.272047 3  0.293133  0.977842  0.273399 4  0.235116       NaN  0.273399 5  1.365186       NaN  0.273399  In [37]: data = pd.Series([1., NA, 3.5, NA, 7])  In [38]: data.fillna(data.mean()) Out[38]:  0    1.000000 1    3.833333 2    3.500000 3    3.833333 4    7.000000 dtype: float64
Argument Description  
value Scalar value or dict-like object to use to fill missing values  
method Interpolation; by default 'ffill' if function called with no other arguments  
axis Axis to fill on; default axis=0  
inplace Modify the calling object without producing a copy  
limit For forward and backward filling, maximum number of consecutive periods to fill  

数据转换

去重

In [39]: data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'],    ....: 'k2': [1, 1, 2, 3, 3, 4, 4]})  In [40]: data Out[40]:      k1  k2 0  one   1 1  two   1 2  one   2 3  two   3 4  one   3 5  two   4 6  two   4  In [41]: data.duplicated() Out[41]:  0    False 1    False 2    False 3    False 4    False 5    False 6     True dtype: bool  In [42]: data.drop_duplicates() Out[42]:      k1  k2 0  one   1 1  two   1 2  one   2 3  two   3 4  one   3 5  two   4  In [43]: data['v1'] = range(7)  In [44]: data.drop_duplicates(['k1']) Out[44]:      k1  k2  v1 0  one   1   0 1  two   1   1  In [45]: data.drop_duplicates(['k1', 'k2'], keep='last') Out[45]:      k1  k2  v1 0  one   1   0 1  two   1   1 2  one   2   2 3  two   3   3 4  one   3   4 6  two   4   6

交流QQ群:python 测试开发 144081101

付费技术支持wechat: pythontesting

本文代码地址:https://github.com/xurongzhong/mobile_data/

本文最新版本地址:http://t.cn/R8tJ9JH

使用函数或者映射(map)转换数据

import pandas as np  import pandas as pd  data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'], 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})  data Out[5]:            food  ounces 0        bacon     4.0 1  pulled pork     3.0 2        bacon    12.0 3     Pastrami     6.0 4  corned beef     7.5 5        Bacon     8.0 6     pastrami     3.0 7    honey ham     5.0 8     nova lox     6.0  meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon' }  lowercased = data['food'].str.lower()  lowercased Out[8]:  0          bacon 1    pulled pork 2          bacon 3       pastrami 4    corned beef 5          bacon 6       pastrami 7      honey ham 8       nova lox Name: food, dtype: object  data['animal'] = lowercased.map(meat_to_animal)  data Out[10]:            food  ounces  animal 0        bacon     4.0     pig 1  pulled pork     3.0     pig 2        bacon    12.0     pig 3     Pastrami     6.0     cow 4  corned beef     7.5     cow 5        Bacon     8.0     pig 6     pastrami     3.0     cow 7    honey ham     5.0     pig 8     nova lox     6.0  salmon  data['food'].map(lambda x: meat_to_animal[x.lower()]) Out[11]:  0       pig 1       pig 2       pig 3       cow 4       cow 5       pig 6       cow 7       pig 8    salmon Name: food, dtype: object

替换

In [2]: import pandas as pd  In [3]: import numpy as np  In [4]: data = pd.Series([1., -999., 2., -999., -1000., 3.])  In [5]: data Out[5]:  0       1.0 1    -999.0 2       2.0 3    -999.0 4   -1000.0 5       3.0 dtype: float64  In [6]: data.replace(-999, np.nan) Out[6]:  0       1.0 1       NaN 2       2.0 3       NaN 4   -1000.0 5       3.0 dtype: float64  In [7]: data.replace([-999, -1000], np.nan) Out[7]:  0    1.0 1    NaN 2    2.0 3    NaN 4    NaN 5    3.0 dtype: float64  In [8]: data.replace([-999, -1000], [np.nan, 0]) Out[8]:  0    1.0 1    NaN 2    2.0 3    NaN 4    0.0 5    3.0 dtype: float64  In [9]: data.replace({-999: np.nan, -1000: 0}) Out[9]:  0    1.0 1    NaN 2    2.0 3    NaN 4    0.0 5    3.0 dtype: float64

索引和列名修改

In [2]: import pandas as pd  In [3]: import numpy as np    In [10]: data = pd.DataFrame(np.arange(12).reshape((3, 4)),    ....: index=['Ohio', 'Colorado', 'New York'],    ....: columns=['one', 'two', 'three', 'four'])  In [11]: data Out[11]:            one  two  three  four Ohio        0    1      2     3 Colorado    4    5      6     7 New York    8    9     10    11  In [5]: data.replace(4, 40) Out[5]:            one  two  three  four Ohio        0    1      2     3 Colorado   40    5      6     7 New York    8    9     10    11   In [12]: transform = lambda x: x[:4].upper()  In [13]: data.index.map(transform) Out[13]: Index(['OHIO', 'COLO', 'NEW '], dtype='object')  In [14]: data Out[14]:            one  two  three  four Ohio        0    1      2     3 Colorado    4    5      6     7 New York    8    9     10    11  In [15]: data.index = data.index.map(transform)  In [16]: data Out[16]:        one  two  three  four OHIO    0    1      2     3 COLO    4    5      6     7 NEW     8    9     10    11  In [17]: data.rename(index=str.title, columns=str.upper) Out[17]:        ONE  TWO  THREE  FOUR Ohio    0    1      2     3 Colo    4    5      6     7 New     8    9     10    11  In [18]: data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'}) Out[18]:           one  two  peekaboo  four INDIANA    0    1         2     3 COLO       4    5         6     7 NEW        8    9        10    11  In [19]: data.rename(index={'OHIO': 'INDIANA'}, inplace=True)  In [20]: data Out[20]:           one  two  three  four INDIANA    0    1      2     3 COLO       4    5      6     7 NEW        8    9     10    11

离散化和面元划分

以下暂略

字符串处理

In [7]: data = {'Dave': 'dave@google.com', 'Steve': 'steve@gmail.com',    ...: 'Rob': 'rob@gmail.com', 'Wes': np.nan}  In [8]: data = pd.Series(data)  In [9]: data Out[9]:  Dave     dave@google.com Rob        rob@gmail.com Steve    steve@gmail.com Wes                  NaN dtype: object  In [10]: data.isnull() Out[10]:  Dave     False Rob      False Steve    False Wes       True dtype: bool  In [11]: data.str.contains('gmail') Out[11]:  Dave     False Rob       True Steve     True Wes        NaN dtype: object  In [12]: pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'  In [13]: data.str.findall(pattern, flags=re.IGNORECASE) --------------------------------------------------------------------------- NameError                                 Traceback (most recent call last) <ipython-input-13-085c16e4dbfe> in <module>() ----> 1 data.str.findall(pattern, flags=re.IGNORECASE)  NameError: name 're' is not defined  In [14]: import re  In [15]: data.str.findall(pattern, flags=re.IGNORECASE) Out[15]:  Dave     [(dave, google, com)] Rob        [(rob, gmail, com)] Steve    [(steve, gmail, com)] Wes                        NaN dtype: object  In [16]: matches = data.str.match(pattern, flags=re.IGNORECASE)  In [17]: matches Out[17]:  Dave     True Rob      True Steve    True Wes       NaN dtype: object  In [18]: matches.str.get(1) Out[18]:  Dave    NaN Rob     NaN Steve   NaN Wes     NaN dtype: float64  In [19]: matches.str[0] Out[19]:  Dave    NaN Rob     NaN Steve   NaN Wes     NaN dtype: float64  In [20]: data.str[:5] Out[20]:  Dave     dave@ Rob      rob@g Steve    steve Wes        NaN dtype: object
Method Description  
cat Concatenate strings element-wise with optional delimiter  
contains Return boolean array if each string contains pattern/regex  
count Count occurrences of pattern  
extract Use a regular expression with groups to extract one or more strings from a Series of strings; the result will be a DataFrame with one column per group  
endswith Equivalent to x.endswith(pattern) for each element  
startswith Equivalent to x.startswith(pattern) for each element  
findall Compute list of all occurrences of pattern/regex for each string  
get Index into each element (retrieve i-th element)  
isalnum Equivalent to built-in str.alnum  
isalpha Equivalent to built-in str.isalpha  
isdecimal Equivalent to built-in str.isdecimal  
isdigit Equivalent to built-in str.isdigit  
islower Equivalent to built-in str.islower  
isnumeric Equivalent to built-in str.isnumeric  
isupper Equivalent to built-in str.isupper  
join Join strings in each element of the Series with passed separator  
len Compute length of each string  
lower, upper Convert cases; equivalent to x.lower() or x.upper() for each element  
match Use re.match with the passed regular expression on each element, returning matched groups as list  
pad Add whitespace to left, right, or both sides of strings  
center Equivalent to pad(side='both')  
repeat Duplicate values (e.g., s.str.repeat(3) is equivalent to x * 3 for each string)  
replace Replace occurrences of pattern/regex with some other string  
slice Slice each string in the Series  
split Split strings on delimiter or regular expression  
strip Trim whitespace from both sides, including newlines  
rstrip Trim whitespace on right side  
lstrip Trim whitespace on left side  

本文发表于2018年02月24日 10:31
(c)注:本文转载自https://my.oschina.net/u/1433482/blog/1623037,转载目的在于传递更多信息,并不代表本网赞同其观点和对其真实性负责。如有侵权行为,请联系我们,我们会及时删除.

阅读 2644 讨论 0 喜欢 0

抢先体验

扫码体验
趣味小程序
文字表情生成器

闪念胶囊

万稳万当,不如一默。任何一句话,你不说出来便是那句话的主人,你说了出来,便是那句话的奴隶。

你要过得好哇,这样我才能恨你啊,你要是过得不好,我都不知道该恨你还是拥抱你啊。

直抵黄龙府,与诸君痛饮尔。

那时陪伴我的人啊,你们如今在何方。

不出意外的话,我们再也不会见了,祝你前程似锦。

快捷链接
网站地图
提交友链
Copyright © 2016 - 2021 Cion.
All Rights Reserved.
京ICP备2021004668号-1